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Preface 


Edward Davenant said he “would have a man knockt in the head that should 
write anything in Mathematiques that had been written of before.” So 
reports John Aubrey in his Brief Lives. What is new here then? 

To introduce the idea of measure the book opens with Borel’s normal 
number theorem, proved by calculus alone, and there follow short sections 
establishing the existence and fundamental properties of probability mea- 
sures, including Lebesgue measure on the unit interval. For simple random 
variables—ones with finite range—the expected value is a sum instead of an 
integral. Measure theory, without integration, therefore suffices for a com- 
pletely rigorous study of infinite sequences of simple random variables, and 
this is carried out in the remainder of Chapter 1, which treats laws of large 
numbers, the optimality of bold play in gambling, Markov chains, large 
deviations, the law of the iterated logarithm. These developments in their 
turn motivate the general theory of measure and integration in Chapters 2 
and 3. 

Measure and integral are used together in Chapters 4 and 5 for the study 
of random sums, the Poisson process, convergence of measures, characteristic 
functions, central limit theory. Chapter 6 begins with derivatives according to 
Lebesgue and Radon—Nikodym—a return to measure theory—then applies 
them to conditional expected values and martingales. Chapter 7 treats such 
topics in the theory of stochastic processes as Kolmogorov’s existence theo- 
rem and separability, all illustrated by Brownian motion. 

What is new, then, is the alternation of probability and measure, probabil- 
ity motivating measure theory and measure theory generating further proba- 
bility. The book presupposes a knowledge of combinatorial and discrete 
probability, of rigorous calculus, in particular infinite series, and of elemen- 
tary set theory. Chapters 1 through 4 are designed to be taken up in 
sequence. Apart from starred sections and some examples, Chapters 5, 6, and 
7 are independent of one another; they can be read in any order. 

My goal has been to write a book I would myself have liked when I first 
took up the subject, and the needs of students have been given precedence 
over the requirements of logical economy. For instance, Kolmogorov’s exis- 
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tence theorem appears not in the first chapter but in the last, stochastic 
processes needed earlier having been constructed by special arguments 
which, although technically redundant, motivate the general result. And the 
general result is, in the last chapter, given two proofs at that. It is instructive, 
I think, to see the show in rehearsal as well as in performance. 


The Third Edition. The main changes in this edition are two. For the 
theory of Hausdorff measures in Section 19 I have substituted an account of 
L? spaces, with applications to statistics. And for the queueing theory in 
Section 24 I have substituted an introduction to ergodic theory, with applica- 
tions to continued fractions and Diophantine approximation. These sections 
now fit better with the rest of the book, and they illustrate again the 
connections probability theory has with applied mathematics on the one hand 
and with pure mathematics on the other. 

For suggestions that have led to improvements in the new edition, I thank 
Raj Bahadur, Walter Philipp, Michael Wichura, and Wing Wong, as well as 
the many readers who have sent their comments. 


Envoy. I said in the preface to the second edition that there would not be 
a third, and yet here it is. There will not be a fourth. It has been a very 
agreeable labor, writing these successive editions of my contribution to the 
river of mathematics. And although the contribution is small, the river is 
great: After ages of good service done to those who people its banks, as 
Joseph Conrad said of the Thames, it spreads out “in the tranquil dignity of a 
waterway leading to the uttermost ends of the earth.” 


PATRICK BILLINGSLEY 


Chicago, Illinois 
December 1994 
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SECTION 1. BOREL’S NORMAL NUMBER THEOREM 


Although sufficient for the development of many interesting topics in mathe- 
matical probability, the theory of discrete probability spaces’ does not go far 
enough for the rigorous treatment of problems of two kinds: those involving 
an infinitely repeated operation, as an infinite sequence of tosses of a coin, 
and those involving an infinitely fine operation, as the random drawing of a 
point from a segment. A mathematically complete development of probabil- 
ity, based on the theory of measure, puts these two classes of problem on the 
same footing, and as an introduction to measure-theoretic probability it is the 
purpose of the present section to show by example why this should be so. 


The Unit Interval 


The project is to construct simultaneously a model for the random drawing of 
a point from a segment and a model for an infinite sequence of tosses of a 
coin. The notions of independence and expected value, familiar in the 
discrete theory, will have analogues here, and some of the terminology of the 
discrete theory will be used in an informal way to motivate the development. 
The formal mathematics, however, which involves only such notions as the 
length of an interval and the Riemann integral of a step function, i 


2 PROBABILITY A 
If 
(I) A= U) k= U (a; 5;), 


where the intervals I, = (a;, b;] are disjoint [A3]" and are contained in Q, 


assign to A the probability 


(13) FC) D 2 (Gea: 


i=1 i=1 


It is important to understand that in this section P(A) is defined only if A is 
a finite disjoint union of subintervals of (0, 1]—never for sets A of any other 


kind. 

If A and B are two such finite disjoint unions of intervals, and if A and B 
are disjoint, then A UB is a finite disjoint union of intervals and 

(1.4) P(A UB) =P(A) +P(B). 


This relation, which is certainly obvious intuitively, is a consequence of the 7m 
additivity of the Riemann integral: f 


(1.5) fF) +2(o)) do = f'f(o) do + f gCo) do: 


If f(w) is a step function taking value c, in the interval (x,_,, x;], where 0 = xo < 7 


x; < +: <x, = 1, then its integral in the sense of Riemann has the value 
i k 
(1.6) f f(w) do = L c(a; t-a): 
j=l 


If f= L and g = I; are the indicators [AS] of A and B, then (1.4) follows from (1.5) 
and (1.6), provided A and B are disjoint. This also shows that the definition (1.3) is 
unambiguous—note that A will have many representations of the form (1.2) because 
(a, b] U(b, c] = (a, c]. Later these facts will be derived anew from the general theory — 


of Lebesgue integration. 


According to the usual models, if a radioactive substance has emitted a | 
single a-particle during a unit interval of time, or if a single telephone call — 
has arrived at an exchange during a unit interval of time, then the instant at @ 
which the emission or the arrival occurred is random in the sense that it lies — 
in (1.2) with probability (1.3). Thus (1.3) is the starting place for the 

5 a 
tA notation [An] refers to paragraph n of the appendix beginning on p. 536; this is a collection — 
of mathematical definitions and facts required in the text. r 

Passages in small type concern side issues and technical matters, but their contents are- 
sometimes required later. re k i a Á 
peg os sakon arasta baan d 
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description of a point drawn at random from the unit interval: Q is regarded 
as a sample space, and the set (1.2) is identified with the event that the 
random point lies in it. 

The definition (1.3) is also the starting point for a mathematical represen- 
tation of an infinite sequence of tosses of a coin. With each w associate its 
nonterminating dyadic expansion 


(1.7) om Ss, oa = dioyak a] 


each d,(w) being 0 or 1 [A31]. Thus 
(1.8) (d\(w),d,(@),...) 


is the sequence of binary digits in the expansion of w. For definiteness, a 
point such as >= .1000... = .0111..., which has two expansions, takes the 


nonterminating one; 1 takes the expansion .111.... 


Graph of d,(w) Graph of d,(w) 


Imagine now a coin with faces labeled 1 and 0 instead of the usual heads 
and tails. If w is drawn at random, then (1.8) behaves as if it resulted from an 
infinite sequence of tosses of a coin. To see this, consider first the set of w 
for which dw) =u, for i=1,...,n, where u,,...,u, is a sequence of 0’s 
and 1’s. Such an wm satisfies 


wu mu, em 
at =e 
2g ses Da a 


where the extreme values of w correspond to the case d,(w) = 0 for i >n and 
the case dw) =1 for i >n. The second case can be achieved, but since the 
binary expansions represented by the d,(w) are nonterminating—do not end 
in 0’s—the first cannot, and w must actually exceed L”_,u,;/2'. Thus 


(1.9) [os d() =u, i= an| at a ee 


4 PROBABILITY 


The interval here is open on the left and closed on the right precisely 
because the expansion (1.7) is the nonterminating one. In the model for coin 
tossing the set (1.9) represents the event that the first n tosses give the 
Outcomes u,,...,U, in sequence. By (1.3) and (1.9), 


s 1 
(1.10) PORO =u;,i=1,...,n] = 57, 


which is what probabilistic intuition requires. 


Decompositions by dyadic intervals 


The intervals (1.9) are called dyadic intervals, the endpoints being adja- 
cent dyadic rationals k/2” and (k + 1)/2” with the same denominator, and 
n is the rank or order of the interval. For each n the 2” dyadic intervals of 
rank n decompose or partition the unit interval. In the passage from the 
partition for n to that for n + 1, each interval (1.9) is split into two parts of 
equal length, a left half on which d,, ,(w) is 0 and a right half on which 
d,,(w) is 1. For u=0 and for u=1, the set [w: d,, (w)=u] is thus a 
disjoint union of 2” intervals of length 1/2”*' and hence has probability i: 
Pla: d,(w) =u] = 5 for all n. 

Note that d,(w) is constant over each dyadic interval of rank i and that for 
n >i each dyadic interval of rank n is entirely contained in a single dyadic 
interval of rank i. Therefore, d,(w) is constant over each dyadic interval of 
rank n if i <n. 

The probabilities of various familiar events can be written down immedi- 
ately. The sum X?_,d;(%w) is the number of 1’s among d,(w),..., d,(w), to be 
thought of as the number of heads in n tosses of a fair coin. The usual 
binomial formula is 3 


El False BE RAS perais <kx<n. 
(1.11) Jo: Saw) =] = (g) 0<k<n 


This follows from the definitions: The set on the left in (1.11) is the union of 
those intervals (1.9) corresponding to sequences u,,...,u, containing k 1’s 
and n — k O's; each such interval has length 1/2” by (1.10) and there are (z 
of them, and so (1.11) follows from (1.3). 
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The functions d,(w) can be looked at in two ways. Fixing n and letting w 
vary gives a real function d,, =4d,(-) on the unit interval. Fixing œ and letting 
n vary gives the sequence (1.8) of 0’s and 1’s. The probabilities (1.10) and 
(1.11) involve only finitely many of the components dw). The interest here, 
however, will center mainly on properties of the entire sequence (1.8). It will 
be seen that the mathematical properties of this sequence mirror the proper- 
ties to be expected of a coin-tossing process that continues forever. 

As the expansion (1.7) is the nonterminating one, there is the defect that 
for no w is (1.8) the sequence (1,0,0,0,...), for example. It seems clear that 
the chance should be 0 for the coin to turn up heads on the first toss and tails 
forever after, so that the absence of (1,0,0,0,...)—or of any other single 
sequence—should not matter. See on this point the additional remarks 
immediately preceding Theorem 1.2. 


The Weak Law of Large Numbers 


In studying the connection with coin tossing it is instructive to begin with a 
result that can, in fact, be treated within the framework of discrete probabil- 
ity, namely, the weak law of large numbers: 


Theorem 1.1. For each e,' 


= È d(w) EF : 


i=1 


(1.12) lim pla: 


n — œ 


>e|=0. 


Interpreted probabilistically, (1.12) says that if n is large, then there is 
small probability that the fraction or relative frequency of heads in n tosses 
will deviate much from 4, an idea lying at the base of the frequency 
conception of probability. As a statement about the structure of the real 
numbers, (1.12) is also interesting arithmetically. 

Since d,(w) is constant over each dyadic interval of rank n if i <n, the 
sum 7_,d,(@) is also constant over each dyadic interval of rank n. The set in 
(1.12) is therefore the union of certain of the intervals (1.9), and so its 
probability is well defined by (1.3). 

With the Riemann integral in the role of expected value, the usual 
application of Chevyshev’s inequality will lead to a proof of (1.12). The 
argument becomes simpler if the d,(@w) are replaced by the Rademacher 
functions, 


(1.13) Ae Yo a a eal | -1 if d,(w) =0. 


‘The standard e and ô of analysis will always be understood to be positive. 
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Graph of r,(w) Graph of r,(w%w) 


Consider the partial sums 


(1.14) s,(@) = L rile): 


i=1 


Since D"_,d,(w) =(s,(w) +n)/2, (1.12) with €/2 in place of e is the same 
thing as 


(BS) lim | a: (fso) =<] =o 


This is the form in which the theorem will be proved. 

The Rademacher functions have themselves a direct probabilistic mean- 
ing. If a coin is tossed successively, and if a particle starting from the origin 
performs a random walk on the real line by successively moving one unit in 
the positive or negative direction according as the coin falls heads or tails, 
then r(w) represents the distance it moves on the ith step and s,(w) 
represents its position after n steps. There is also the gambling interpreta- 
tion: If a gambler bets one dollar, say, on each toss of the coin, rw) 
represents his gain or loss on the ith play and s,(w) represents his gain or 
loss in n plays. 

Each dyadic interval of rank i — 1 splits into two dyadic intervals of rank 1; 
r;(w) has value — 1 on one of these and value +1 on the other. Thus r;(@) is 
—1 on a set of intervals of total length 4 and +1 on a set of total length 5. 


Hence /jr{w) dw = 0 by (1.6), and 


(1.16) [ so) deo =0 


by (1.5). If the integral is viewed as an expected value, then (1.16) says that 
the mean position after n steps of a random walk is 0. 

Suppose that i<j. On a dyadic interval of rank j — 1, r{@) is constant 
and r,(w) has value —1 on the left half and +1 on the right. The product 
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r(w)rj(@) therefore integrates to 0 over each of the dyadic intervals of rank 
j—1, and so 


(1.17) [rl o)r(@) dø =0, i#j. 


This corresponds to the fact that independent random variables are uncorre- 
lated. Since r?(w) = 1, expanding the square of the sum (1.14) shows that 


(1.18) [(s2(@) do =n. 
0 


This corresponds to the fact that the variances of independent random 
variables add. Of course (1.16), (1.17), and (1.18) stand on their own, in no 
way depend on any probabilistic interpretation. 
Applying Chebyshev’s inequality in a formal way to the probability in 
(1.15) now leads to 
1 1. 


1 
1S Plo: > < z dw = : 
(1.19) Lo: |s,(@)| = ne] < 55 f si(w) dw = —s 


The following lemma justifies the inequality. 


Let f be a step function as in (1.6): flo) =c; for w © (x,_,, x;], where 
aay Seer? Ka, = Il, 


Lemma. /f fis a nonnegative step function, then [w: f(w) >a] is for a>0 
a finite union of intervals and 


(1.20) P[w: f(w) >a] < gf fo) de. 


The shaded region 
has area 
aP(w: f(w) > a}. 


Proor. The set in question is the union of the intervals (x,_,, x,] for 
which c;>a. If XY denotes summation over those j satisfying c;2a, then 
Plw: f(w) >a] = L(x; —x;_,) by the definition (1.3). On the other hand, 
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since the c; are all nonnegative by hypothesis, (1.6) gives 


k 
[f(w) dw = E cx- x)= L's) 
0 


j=-1 
>) a(x) 3 
Hence (1.20). ü 


Taking a = n?e? and f(w)=s}(w) in (1.20) gives (1.19). Clearly, (1.19) 
implies (1.15), and as already observed, this in turn implies (1.12). 


The Strong Law of Large Numbers 


It is possible with a minimum of technical apparatus to prove a stronger 
result that cannot even be formulated in the discrete theory of probability. 


Consider the set 


i Pia 1 
(1-21) NETO lim 5 Lalo) =) 


consisting of those w for which the asymptotic relative frequency“ of 1 in the 
sequence (1.8) is 4. The points in (1.21) are called normal numbers. The idea 
is to show that a real number œw drawn at random from the unit interval is 
“practically certain” to be normal, or that there is “practical certainty” that 1 
occurs in the sequence (1.8) of tosses with asymptotic relative frequency L It 
is impossible at this stage to prove that P(N) = 1, because N is not a finite 
union of intervals and so has been assigned no probability. But the notion of 
“practical certainty” can be formalized in the following way. 

Define a subset A of Q to be negligible’ if for each positive e there exists 


a finite or countable* collection /,, /,,... of intervals (they may overlap) 
satisfying 
C22 Mie O 
k 
and 
CU) PSA 
k 


A negligible set is one that can be covered by intervals the total sum of 
whose lengths can be made arbitrarily small. If P(A) is assigned to such an 


“The frequency of 1 (the number of occurrences of it) among d,(w),...,d,(w) is £"_ dw), the 
relative frequency is Be OE ON and the asymptotic relative frequency is the limit in (1.21). 
The term negligible is introduced for the purposes of this section only. The negligible sets will 
reappear later as the sets of Lebesgue measure 0. 
Countably infinite is unambiguous. Countable will mean finite or countably infinite, although it 
will sometimes for emphasis be expanded as here to finite or countable. 
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A in any reasonable way, then for the 7, of (1.22) and (1.23) it ought to be 
true that P(A) < £, P(1,) = ¥,|t,|<«, and hence P(A) ought to be 0. Even 
without any assignment of probability at all, the definition of negligibility can 
serve as it stands as an explication of “practical impossibility” and “practical 
certainty”: Regard it as practically impossible that the random w will lie in A 
if A is negligible, and regard it as practically certain that w will lie in A if its 
complement A‘ [A1] is negligible. 

Although the fact plays no role in the next proof, for an understanding of 
negligibility observe first that a finite or countable union of negligible sets is 
negligible. Indeed, suppose that Ay, Ap}... are negligible. Given e€, for each 
n choose intervals T, J,>,... such that A, C U, J,, and Lhe e/2”: All 
the intervals 7,, taken together form a countable collection covering U, Apn 
and their lengths add to 5, Y,|I,,,| < L,€/2” =e. Therefore, U, A, is negli- 
gible. 

A set consisting of a single point is clearly negligible, and so every countable 
set is also negligible. The rationals for example form a negligible set. In the 
coin-tossing model, a single point of the unit interval has the role of a single 
sequence of 0’s and 1’s, or of a single sequence of heads and tails. It 
corresponds with intuition that it should be “practically impossible” to toss a 
coin infinitely often and realize any one particular infinite sequence set down 
in advance. It is for this reason not a real shortcoming of the model that for 
no w is (1.8) the sequence (1,0,0,0,...). In fact, since a countable set is 
negligible, it is not a shortcoming that (1.8) is never one of the countably 
many sequences that end in 0’s. 


Theorem 1.2. The set of normal numbers has negligible complement. 


This is Borel’s normal number theorem,‘ a special case of the strong law of 
large numbers. Like Theorem 1.1, it is of arithmetic as well as probabilistic 
interest. 

The set N° is not countable: Consider a point œ for which 
(dw), d,(w),...) = (1,1, u3, 1, 1, u,¢,...)—that is, a point for which d(w)=1 
unless i is a multiple of 3. Since n~'L7_,d,(w) = $, such a point cannot be 
normal. But there are uncountably many such points, one for each infinite 
sequence (u,U,,...) of 0’s and 1’s. Thus one cannot prove N° negligible by 
proving it countable, and a deeper argument is required. 


PROOF OF THEOREM 1.2. Clearly (1.21) and 
Abate | 
(1.24) N= [o: lim 7p nl) = J 


tÉmile Borel: Sur les probabilités dénombrables et leurs applications arithmétiques, Circ. Mat. 
d. Palermo, 29 (1909), 247-271. See Dup.ey for excellent historical notes on analysis and 


probability. 
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define the same set (see (1.14)). To prove N“ negligible requires constructing 
coverings that satisfy (1.22) and (1.23) for A = N°. The construction makes 


use of the inequality 


1 4 
(1.25) P[w:|s,(w)|>ne] < x fsn(o) deo. 


ne 


ment that leads to the inequality in (1.19)—it jg 


This follows by the same argu ' 
only necessary to take f(w) = s4(w) and a =n*e* in (1.20). As the integral in 
(1.25) will be shown to have order n?, the inequality is stronger than (1.19), 


The integrand on the right in (1.25) is 
(1.26) sn (@) a Vi r(@)rg(@)7,(@)rs(@), 


where the four indices range independently from 1 to n. Depending on how 
the indices match up, each term in this sum reduces to one of the following 
five forms, where in each case the indices are now distinct : 


7 (®) = Il 
ri(w)r; (w) = 1, 

(1.27) r?(w)r;(@)r,(w) =7)(@)r(@); 
r?(@)r;(@) =r(@)r(@), 
r(w)rj(@)r(@)r(o).- 


If, for example, k exceeds i, j, and /, then the last product in (1.27) 
integrates to 0 over each dyadic interval of rank k — 1, because r(w)r(w)r(w) 
is constant there, while 7,(w) is —1 on the left half and +1 on the right. 
Adding over the dyadic intervals of rank k — 1 gives 


i 'r,(w)r(@)r,(@)r)(w) dw =0. 


This holds whenever the four indices are distinct. From this and (1.17) it 
follows that the last three forms in (1.27) integrate to 0 over the unit interval; 
of course, the first two forms integrate to 1. 

The number of occurrences in the sum (1.26) of the first form in (1.27) is 
n. The number of occurrences of the second form is 3n(n — 1), because there 
are n choices for the a in (1.26), three ways to match it with B, y, or ô, and 
n — 1 choices for the value common to the remaining two indices. A term-by- 
term integration of (1.26) therefore gives 


(1.28) C dw =n +3n(n — 1) <3n’, 
0 
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and it follows by (1.25) that 


1 3 
(1.29) Po: |2s,(#)| =e] < Bei 


Fix a positive sequence {e,} going to 0 slowly enough that the series 
L,e, “NT? converges (take e, =n~'/8, for example). If A, =[o: In—'s,(w)| = 
e,], then P(A,) < 3€, 4n~? by (1.29), and so Z, PCA) < &. 

If, for some m, w lies in A‘ for all n greater than or equal to m, then 
In-'s,(w)| < e, for n >m, and it follows that w is normal because e,, > 0 (see 
(1.24)). In other words, for each m, N-AS, CN, which is the same thing as 
N° C UŞ mAn,- This last relation leads to the required covering: Given e€, 
choose m so that L*_,, P(A,) < e. Now A, isa finite disjoint union U, I,,, of 
intervals with ©,|/,,|=P(A,), and therefore U®_,, 4, is a countable union 
Usem Ux tnx Of intervals (not disjoint, but that does not matter) with 
Elomon = D22,,P(A,)\<e. The intervals I, (1 =m, k21 provide a 
covering of N° of the kind the definition of negligibility calls for. a 


Strong Law Versus Weak 


Theorem 1.2 is stronger than Theorem 1.1. A consideration of the forms of the two 
propositions will show that the strong law goes far beyond the weak law. 

For each n let f,(@) be a step function on the unit interval, and consider the 
relation 


(1.30) lim P[w:|f,(@)| =e] =0 
n—oco 
together with the set 
(1.31) [o lim fn(@) = 0}. 
n—oco 


If f,(w) =n ‘'s,(w), then (1.30) reduces to the weak law (1.15), and (1.31) coincides 
with the set (1.24) of normal numbers. According to a general result proved below 
(Theorem 5.2(ii)), whatever the step functions f,(@) may be, if the set (1.31) has 
negligible complement, then (1.30) holds for each positive e. For this reason, a proof 
of Theorem 1.2 is automatically a proof of Theorem 1.1. 

The converse, however, fails: There exist step functions f,(@) that satisfy (1.30) for 
each positive e but for which (1.31) fails to have negligible complement (Example 5.4). 
For this reason, a proof of Theorem 1.1 is not automatically a proof of Theorem 1.2; 
the latter lies deeper and its proof is correspondingly more complex. 


Length 


According to Theorem 1.2, the complement N° of the set of normal numbers 
is negligible. What if N itself were negligible? It would then follow that 
(0,1]= NUN‘ was negligible as well, which would disqualify negligibility as 
an explication of “practical impossibility,” as a stand-in for “probability 
zero.” The proof below of the “obvious” fact that an interval of positive 
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length is not negligible (Theorem 1.3(ii)), while simple enough, does involve 
the most fundamental properties of the real number system. 

Consider an interval /=(a,b] of length |/|=b—a; see (1.1). Consider 
also a finite or infinite sequence of intervals J, =(a,,b;,]. While each of 
these intervals is bounded, they need not be subintervals of (0, 1]. 


Theorem 1.3. (i) Jf U, 1, CI, and the I, are disjoint, then Ll, <|. 
Gii) If I c U, I, (the I, need not be disjoint), then |I| < £X, |I]. 
(iii) If I = U, Iņ, and the I, are disjoint, then |I| = X,| Il. 


Proor. Of course (iii) follows from (i) and (Gi). 


PRoor oF (i): Finite case. Suppose there are n intervals. The result 
being obvious for n = 1, assume that it holds for n — 1. If a, is the largest 
among 4j,...,a, (this is just a matter of notation), then Uz} (a,,b,]c 
(a,a,,], so that L%—\(b, —a,)<a,,—a by the induction hypothesis, and 
hence Xr- (>, —a,) = (a, —a) Pb, —a)) = ba. 

Infinite case. If there are infinitely many intervals, each finite subcollection 
satisfies the hypotheses of (i), and so £% _ (b, — a,) <b — a by the finite case. 
But as n is arbitrary, the result follows. 


PROOF OF (ii): Finite case. Assume that the result holds for the case of 
n — 1 intervals and that (a, b] C Uj_,(a,, bx]. Suppose that a, <b <b, (no- 
tation again). If a,<a, the result is obvious. Otherwise, (a, anke 
Uki (ak, bl, so that E}Zi(b,—a,)2a„—a by the induction hypothesis 
and hence Xz-1bk — a,) > (a, —a)+ (b, —a,)>b—a. The finite case thus 
follows by induction. 

Infinite case. Suppose that (a, b] C UZ- ,(a,, bp]. If 0 < € <b — a, the open 
intervals (a,, b, + €2 “) cover the closed interval [a + e, b], and it follows by 
the Heine—Borel theorem [A13] that [a + €,b]< U%_,(a,,b, + €2~*) for 
some n. But then (a + €,b] C Uj_, (ax, b, + €2~*], and by the finite case, 
bim=(acte) Spi (by + 27" — a) = Dp (b, = ay) +e. ‘Since e. washarne 
trary, the result follows. H 


Theorem 1.3 will be the starting point for the theory of Lebesgue measure 
as developed in Sections 2 and 3. Taken together, parts (i) and (ii) of the 
theorem for only finitely many intervals I, imply (1.4) for disjoint A and B. 
Like (1.4), they follow immediately from the additivity of the Riemann 
integral; but the point is to give an independent development of which the 
Riemann theory will be an eventual by-product. 

To pass from the finite to the infinite case in part (i) of the theorem is 
easy. But to pass from the finite to the infinite case in part (ii) involves 
compactness, a profound idea underlying all of modern analysis. And it is 
part (ii) that shows that an interval J of positive length is not negligible: |/| is 
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a positive lower bound for the sum of the lengths of the intervals in any 
covering of J. 


The Measure Theory of Diophantine Approximation* 


Diophantine approximation has to do with the approximation of real numbers x by 
rational fractions p/q. The measure theory of Diophantine approximation has to do 
with the degree of approximation that is possible if one disregards negligible sets of 
real x. 

For each positive integer q, x must lie between some pair of successive multiples 
of 1/q, so that for some p, |x —p/q|<1/g. Since for each q the intervals 


Ihe yA 1 
1.32 (2-4 <i 
( ) q GPW 4G 


decompose the line, the error of approximation can be further reduced to 1 / 2q: For 
each q there is a p such that |x —p/q|<1/2q. These observations are of course 
trivial. But for “most” real numbers x there will be many values of p and q for which 
x lies very near the center of the interval (1.32), so that p/q is a very sharp 
approximation to x. 


Theorem 1.4. [f x is irrational, there are infinitely many irreducible fractions p /4 
such that 


(1.33) k- p 


This famous theorem of Dirichlet says that for infinitely many p and q, x lies in 
(p/qa-1/q*, p/q+1/q’) and hence is indeed very near the center of (1.32). 


Proor. For a positive integer Q, decompose [0,1) into the Q subintervals 
(i -1)/Q0,i/Q), i=1,...,O. The points (fractional parts) {qx} = gx —|qx]| for q = 
0,1,...,Q lie in [0,1), and since there are Q + 1 points’ and only Q subintervals, it 
follows (Dirichlet’s drawer principle) that some subinterval contains more than one 
point. Suppose that {q’x} and {q”x} lie in the same subinterval and 0 <q’ < q” <Q. 
Take q =q" —q' and p=lq"x]—l|q’x]; then 1<q <Q and |qx —p|={{q” x} — {q’x} 


<1 /O: 


(1.34) [z= 


If p and q have any common factors, cancel them; this will not change the left side of 
(1.34), and it will decrease q. 

For each Q, therefore, there is an irreducible p/q satisfying (1.34).* Suppose 
there are only finitely many irreducible solutions of (1.33), say Pi /q,,---»Pm/4m: 
Since x is irrational, the |x — p; /q,| are all positive, and it is possible to choose Q so 
that Q~' is smaller than each of them. But then the p/q of (1.34) is a solution of 
(1.33), and since |x — p /q| < 1 / Q, there is a contradiction. a 


“This topic may be omitted. 

‘Although the fact is not technically necessary to the proof, these points are distinct: {q’x} = {q” x} 
implies (g” — q')x =|q"x]—|q'x], which in turn implies that x is rational unless q’ = q”. 

*This much of the proof goes through even if x is rational. 
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In the measure theory of Diophantine approximation, one looks at the set of real x 
having such and such approximation properties and tries to show that this set jg 
negligible or else that its complement is. Since the set of rationals is negligible 
Theorem 1.4 implies such a result: Apart from a negligible set of x, (1.33) has 


infinitely many irreducible solutions. : 
What happens if the inequality (1.33) is tightened? Consider 


: renee 
L- =| < alee: 

Sa a|^ ala) 

and let A. consist of the real x for which (1.35) has infinitely many irreducible 

solutions. Under what conditions on ¢ will A, have negligible complement? If 

(q) < 1, then (1.35) is weaker than (1.33): p(q) > 1 in the interesting cases. Since x 

satisfies (1.35) for infinitely many irreducible p/q if and only if x —|x] does, A, may 


as well be redefined as the set of x in (0, 1) (or even as the set of irrational x in (0, 1)) 
for which (1.35) has infinitely many solutions. 


Theorem 1.5. Suppose that is positive and nondecreasing. If 


l = 00 
(1.36) b Tla) 3 


then A, has negligible complement. 


Theorem 1.4 covers the case y(q) = 1. Although this is the natural place to state 
Theorem 1.5 in its general form, the proof, which involves continued fractions and the 
ergodic theorem, must be postponed; see Section 24, p. 324. The converse, on the 
other hand, has a very simple proof. 


Theorem 1.6. Suppose that ¢ is positive. If 
(1.37) Ds ae 00, 
7 4¢(4) 
then A , is negligible. 


Proor. Given e, choose qo so that X; > 4,1/ qelq) <e/4. If x €A,, then (1.35) 
holds for some q > qọ, and since 0<x <1, the corresponding p lies in the range 
0 <p <q. Therefore, 


q k 
p 1 D 1 
Ave (J Oe a 

ý q=q p=0 1 q°¢(4) q a’°g(4) 


The right side here is a countable union of intervals covering A = and the sum of 
their lengths is 


x eae 2(q+1) — D ad 


q42zq4ọ p=0 a°p(a) 42=qo a°g(4) q=40 a¢(q) 


Thus A, satisfies the definition ((1.22) and (1.23)) of negligibility. a 


SECTION 1. BOREL’S NORMAL NUMBER THEOREM 15 


If ¢,(¢)=1, then (1.36) holds and hence A, has negligible complement (as 
follows also from Theorem 1.4). If 9,(q)=4q‘, however, then (1.37) holds and 
Alo itself is negligible. Outside the negligible set Ap VAgy» therefore, |x —p/ai< 
1/q? has infinitely many irreducible solutions but |x—p/q|<1/q?‘* has only 
finitely many. Similarly, since £,1 /(q log q) diverges but £,1 /(q log’ *“q) converges, 
outside a negligible set |x —p/q|<1/(q*logq) has infinitely many irreducible 
solutions but |x — p /q| < 1/(q* log! **q) has only finitely many. 

Rational approximations to x obtained by truncating its binary (or decimal) 
expansion are very inaccurate: see Example 4.17. The sharp rational approximations 
to x come from truncation of its continued-fraction expansion: see Section 24. 


PROBLEMS 


Some problems involve concepts not required for an understanding of the text, or 
concepts treated only in later sections; there are no problems whose solutions are 
used in the text itself. An arrow f points back to a problem (the one immediately 
preceding if no number is given) the solution and terminology of which are assumed. 
See Notes on the Problems, p. 552. 


1.1. (a) Show that a discrete probability space (see Example 2.8 for the formal 
definition) cannot contain an infinite sequence A,,A,,... of independent 
events each of probability 5. Since A, could be identified with heads on the nth 
toss of a coin, the existence of such a sequence would make this section 
superfluous. 

(b) Suppose that 0 <p, <1, and put a, = min{p,, 1 —p,}. Show that, if L,a,, 
diverges, then no discrete probability space can contain independent events 
A,, Az,... such that A, has probability p,. 


1.2. Show that N and N°“ are dense [A15] in (0, 1]. 


1.3. t Define a set A to be trifling’ if for each e there exists a finite sequence of 
intervals J, satisfying (1.22) and (1.23). This definition and the definition of 


negligibility apply as they stand to all sets on the real line, not just to subsets of 
(0, 1]. 


(a) Show that a trifling set is negligible. 

(b) Show that the closure of a trifling set is also trifling. 

(c) Find a bounded negligible set that is not trifling. 

(d) Show that the closure of a negligible set may not be negligible. 


(e) Show that finite unions of trifling sets are trifling but that this can fail for 
countable unions. 


1.4. t For i=0,...,r—1, let A,(i) be the set of numbers in (0, 1] whose nonter- 
minating expansions in the base r do not contain the digit i. 


(a) Show that A,(i) is trifling. 


(b) Find a trifling set A such that every point in the unit interval can be 
represented in the form x +y with x and y in A. 


"Like negligible, trifling is a nonce word used only here. The trifling sets are exactly the sets of 
content 0: See Problem 3.15. 
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1.5. 


1.6. 


1.7. 


1.8. 


PROBABILITY 
(c) Let A,(i;,---,é,) consist of the numbers in the unit interval in whose base-r 
expansions the digits i,,...,4, nowhere appear consecutively in that order, 


Show that it is trifling. What does this imply about the monkey that types at 
random? 


+ The Cantor set C can be defined as the closure of A,(1). 


(a) Show that C is uncountable but trifling. 

(b) From [0,1] remove the open middle third (4,4); from the remainder, a 
union of two closed intervals, remove the two open middle thirds (3,3) and 
2,5). Show that C is what remains when this process is continued ad infinitum. 


(c) Show that C is perfect [A15]. 


Put M(t) = fe" dw, and show by successive differentiations under the 
integral that 


(1.38) M(0) = fsi) do. 


Over each dyadic interval of rank n, s,(w) has a constant value of the form 
4141+ -:: #1, and therefore M(N 2 mX exp i Ge le eee 1), where 
the sum extends over all 2” n-long sequences of +1’s and —1’s. Thus 


e' +e 


(1.39) mcr) = (24) = (cosh ”. 


Use this and (1.38) to give new proofs of (1.16), (1.18), and (1.28). (This, the 
method of moment generating functions, will be investigated systematically in 
Section 9.) 


+ By an argument similar to that leading to (1.39) show that the Rademacher 
functions satisfy 
n 


ellk JL e ilk 


J apli È arlo)| dw = 


k=1 2 
n 

= [| cosa,. 
k=1 


Take a, =t2~*, and from L7_,r,(w)2* = 2w — 1 deduce 


. t co t 
(1.40) zL aoea 
0 k=1 De 


by letting n — œ inside the integral above. Derive Vieta’s formula 


v2 y2+v2 y2+y2+v2 
2 2 


2 


A number w is normal in the base 2 if and only if for each positive e there exists 
an no(e,w) such that |n 7'E? ;d;(@)— 4|<e€ for all n exceeding nole, w). 
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Theorem 1.2 concerns the entire dyadic expansion, whereas Theorem 1.1 
concerns only the beginning segment. Point up the difference by showing that 


for e < } the n,(e, w) above cannot be the same for all w in N—in other words, 
n'y" _,d(w) converges to + for all w in N, but not uniformly. But see 


Problem 13.9. 


1.9. 1.37 (a) Using the finite form of Theorem 1.3(ii), together with Problem 
1.3(b), show that a trifling set is nowhere dense [A15]. 


(b) Put B= U,(r, —2°-"-2, r, + 27"7], where r}, r2,... is an enumeration of 
the rationals in (0, 1]. Show that (0,1]— B is nowhere dense but not trifling or 
even negligible. 


(c) Show that a compact negligible set is trifling. 


1.10. t A set of the first category [A15] can be represented as a countable union of 
nowhere dense sets; this is a topological notion of smallness, just as negligibility 
is a metric notion of smallness. Neither condition implies the other: 


(a) Show that the nonnegligible set N of normal numbers is of the first category 
by proving that A,,=_,[w: |n_'s,(w)| <4] is nowhere dense and NC 
U} ZA OF 


(b) According to a famous theorem of Baire, a nonempty interval is not of the 


first category. Use this fact to prove that the negligible set N° = (0, 1] — N is not 
of the first category. 


1.11. Prove: 


(a) If x is rational, (1.33) has only finitely many irreducible solutions. 


(b) Suppose that y(q)> 1 and (1.35) holds for infinitely many pairs p,q but 
only for finitely many relatively prime ones. Then x is rational. 


(c) If œ goes to infinity too rapidly, then A, is negligible (Theorem 1.6). But 
however rapidly œ goes to infinity, A, is nonempty, even uncountable. Hint: 
Consider z = Dr 11/2 ior integral a(k) increasing very rapidly to infinity. 
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Spaces 


Let © be an arbitrary space or set of points w. In probability theory Q 
consists of all the possible results or outcomes w of an experiment or 
observation. For observing the number of heads in n tosses of a coin the 
space Q is {0,1,..., n}; for describing the complete history of the n tosses Q 
is the space of all 2” n-long sequences of H’s and T’s; for an infinite 
sequence of tosses Q can be taken as the unit interval as in the preceding 
section; for the number of a-particles emitted by a substance during a unit 
interval of time or for the number of telephone calls arriving at an exchange 
Q is {0,1,2,...}; for the position of a particle Q is three-dimensional 
Euclidean space; for describing the motion of the particle Q is an appropri- 
ate space of functions; and so on. Most Q’s to be considered are interesting 
from the point of view of geometry and analysis as well as that of probability. 
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Viewed probabilistically, a subset of Q is an event and an element w of Q 
1S a sample point. 


Assigning Probabilities 

In setting up a space Q as a probabilistic model, it is natural to try and assign 
probabilities to as many events as possible. Consider again the case Q = (0,1) 
— the unit interval. It is natural to try and go beyond the definition (1.3) and 
assign probabilities in a systematic way to sets other than finite unions of 
intervals. Since the set of nonnormal numbers is negligible, for example, one 
feels it ought to have probability 0. For another probabilistically interesting 
set that is not a finite union of intervals, consider 


(2.1) U [w: —a<s,(@),...,5,-1(@) <4, s,(w) = —al, 


n=1 


where a and b are positive integers. This is the event that the gambler’s 
fortune reaches —a before it reaches +b; it represents ruin for a gambler 
with a dollars playing against an adversary with b dollars, the rule being that 
they play until one or the other runs out of capital. 

The union in (2.1) is countable and disjoint, and for each n the set in the 
union is itself a union of certain of the intervals (1.9). Thus (2.1) is a 
countably infinite disjoint union of intervals, and it is natural to take as its 
probability the sum of the lengths of these constituent intervals. Since the set 
of normal numbers is not a countable disjoint union of intervals, however, 
this extension of the definition of probability would still not cover all the 
interesting sets (events) in (0, 1]. 

It is, in fact, not fruitful to try to predict just which sets probabilistic 
analysis will require and then assign probabilities to them in some ad hoc 
way. The successful procedure is to develop a general theory that assigns 
probabilities at once to the sets of a class so extensive that most of its 
members never actually arise in probability theory. That being so, why not 
ask for a theory that goes all the way and applies to every set in a space Q? 
In the case of the unit interval, should there not exist a well-defined 
probability that the random point w lies in A, whatever the set A may be? 
The answer’turns out to be no (see p. 45), and it is necessary to work within 
subclasses of the class of all subsets of a space ©. The classes of the 
appropriate kinds—the fields and o-fields—are defined and studied in this 
section. The theory developed here covers the spaces listed above, including 
the unit interval, and a great variety of others. . 


Classes of Sets 


It is necessary to single out for special treatment classes of subsets of a space 
Q, and to be useful, such a class must be closed under various of the 
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operations of set theo 


ry. Once again the unit interval provides an instructive 
example. 


Example 2.1.“ Consider the set N of normal numbers in the form (1.24), 
where S,(@) is the sum of the first n Rademacher functions. Since a point @ 
lies in N if and only if lim, n—'s,(@) =0, N can be put in the form 


(2.2) N= {)\ U N [o: |n-'s,()| <k]: 


Indeed, because of the very meaning of union and of intersection, œw lies in 
the set on the right here if and only if for every k there exists an m such that 
|n~'s,(@)|<k~' holds for all n>m, and this is just the definition of 
convergence to 0—with the usual e replaced by k~! to avoid the formation 
of an uncountable intersection. Since s,(w) is constant over each dyadic 
interval of rank n, the set [a: n~'s (w)|<k~'] is a finite disjoint union of 


intervals. The formula (2.2) shows explicitly how N is constructed in steps 
from these simpler sets. a 


A systematic treatment of the ideas in Section 1 thus requires a class of 
sets that contains the intervals and is closed under the formation of count- 
able unions and intersections. Note that a singleton [A1] {x} is a countable 
intersection ),,(x —n~!, x] of intervals. If a class contains all the singletons 
and is closed under the formation of arbitrary unions, then of course it 
contains all the subsets of Q. As the theory of this section and the next does 
not apply to such extensive classes of sets, attention must be restricted to 
countable set-theoretic operations and in some cases even to finite ones. 

Consider now a completely arbitrary nonempty space Q. A class FY of 
subsets of © is called a field’ if it contains Q itself and is closed under the 
formation of complements and finite unions: 


Gi) XE F; 
GD AEF implies A E€ F; 
Gii) A, Be F implies A UBE F. 


Since Q and the empty set Ø are complementary, (i) is the same in the 
presence of (ii) as the assumption Ø € F. In fact, (i) simply ensures that F 
is nonempty: If A€ F, then A€ F by (ii) and Q=A UA E F by (iii. 

By DeMorgan’s law, A N B = (A° U B°) and A U B = (AF N B°). If F is 
closed under complementation, therefore, it is closed under the formation of 
finite unions if and only if it is closed under the formation of finite intersec- 


*Many of the examples in the book simply illustrate the concepts at hand, but others contain 
definitions and facts needed subsequently. 
The term algebra is often used in place of field. 
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tions. Thus (iii) can be replaced by the requirement 


Gii) A, Be F implies ANB € F. 


A class ¥ of subsets of Q is a ø-field if it is a field and if it is also closed 
under the formation of countable unions: 


GY) A 42) r implies 4, UAL OF SF 


By the infinite form of DeMorgan’s law, assuming (iv) is the same thing as 


assuming 
(iv’) A,, A>,...€F implies A,NA,N-°* © F. 


Note that (iv) implies (iii) because one can take A, =A and A, =B for 
n > 2. A field is sometimes called a finitely additive field to stress that it need 
not be a o-field. A set in a given class Z is said to be measurable F or to be 
an ¥set. A field or o-field of subsets of Q will sometimes be called a field or 


o-field in Q. 


Example 2.2. Section 1 began with a consideration of the sets (1.2), the 
finite disjoint unions of subintervals of Q = (0, 1]. Augmented by the empty 
set, this class is a field Z: Suppose that A =(a,,a,]U--- U(a,, a’, ], 
where the notation is so chosen that a; < --- <a,,. If the (a; a;] are 
disjoint, then A‘ is (0,a,] U(a),a,]U --: Ua, _1,4,,] U m I] and so lies 
in Z, (some of these intervals may be empty, as a; and a;,, may coincide), 
If B=(b,,b,]U--- U,, 5) ], the (6,,5;] again disjoint, then AN B= 


mU? Ala, a] (b; bj}; each intersection here is again an interval or 


else the empty set, and the union is disjoint, and hence A B is in @). Thus 
ZB, satisfies (i), (ii), and ii’). 

Although @, is a field, it is not a o-field: It does not contain the 
singletons {x}, even though each is a countable intersection , (x —n7~!, x] 
of @-sets. And Z, does not contain the set (2.1), a countable union of 
intervals that cannot be represented as a finite union of intervals. The set 
(2.2) of normal numbers is also outside Bp. a 


The definitions above involve distinctions perhaps most easily made clear 
by a pair of artificial examples. 


Example 2.3. Let Z consist of the finite and the cofinite sets (A being 
cofinite if A‘ is finite). Then Z is a field. If Q is finite, then Z contains all 
the subsets of Q and hence is a ø-field as well. If Q is infinite, however, then 
F is not a ø-field. Indeed, choose in Q a set A that is countably infinite and 
has infinite complement. (For example, choose a sequence w, @z,... Of 
distinct points in Q and take A ={w,,w,,...}.) Then A £ F, even though 
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A is the union, necessarily countable, of the singletons it contains and each 
singleton is in F. This shows that the definition of ø-field is indeed more 
restrictive than that of field. m 


Example 2.4. Let F consist of the countable and the cocountable sets (A 
being cocountable if A‘ is countable). Then F is a o-field. If Q is 
uncountable, then it contains a set A such that A and A° are both 
uncountable.’ Such a set is not in Z, which shows that even a o-field may 
not contain all the subsets of Q; furthermore, this set is the union (uncounta- 
ble) of the singletons it contains and each singleton is in F, which shows that 
a o-field may not be closed under the formation of arbitrary unions. | 


The largest o-field in Q is the power class 2, consisting of all the subsets 
of Q; the smallest o-field consists only of the empty set and Q itself. 

The elementary facts about fields and o-fields are easy to prove: If F is a 
field, then A, B€ F implies A- B=ANB E€ F and AaB =(A- B)U 
(B —A)e FZ. Further, it follows by induction on n that A}... A E F 
implies A4 U =- UA, E F and A ALINA E F. 

A field is closed under the finite set-theoretic operations, and a o-field is 
closed also under the countable ones. The analysis of a probability problem 
usually begins with the sets of some rather small class 7, such as the class of 
subintervals of (0,1]. As in Example 2.1, probabilistically natural construc- 
tions involving finite and countable operations can then lead to sets outside 
the initial class 7. This leads one to consider a class of sets that (i) contains 
& and (ii) is a o-field; it is natural and convenient, as it turns out, to 
consider a class that has these two properties and that in addition (iii) is in a 
certain sense as small as possible. As will be shown, this class is the 
intersection of all the o-fields containing Æ; it is called the o-field generated by 
& and is denoted by o(.07/). 

There do exist o-fields containing Z, the class of all subsets of © being 
one. Moreover, a completely arbitrary intersection of o-fields (however many 
of them there may be) is itself a o-field: Suppose that F= Ny Fp, where 0 
ranges over an arbitrary index set and each ¥Y, is a o-field. Then NE ÆA 
for all 6, so that QE F. And A € F implies for each 0 that AE FY, and 
hence A‘ E F, so that A E F. If A, © F for each n, then A, E A for 
each n and 0, so that U,, A, lies in each A% and hence in F. 

Thus the intersection in the definition of o(.o/) is indeed a o-field 
containing 2. It is as small as possible, in the sense that it is contained in 
every o-field that contains 7: if “WC Y and Ž is a o-field, then 2 is one of 


‘If Q is the unit interval, for example, take A=(0,4], say. To show that the general — 
uncountable Q contains such an A requires the axiom of choice [A8]. As a matter of fact, to 
prove the existence of the sequence alluded to in Example 2.3 requires a form of the axiom of 
choice, as does even something so apparently down-to-earth as proving that a countable union of 
negligible sets is negligible. Most of us use the axiom of choice completely unaware of the fact. 
Even Borel and Lebesgue did; see WAGON, pp. 217 ff. 
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the o-fields in the intersection defining o(.07), so that o(07)C 4%. Thus 
a(.7/) has these three properties: 


(i) ZETA 
(ii) o( X ) is a o-field; 
(iii) if CY and F is a o-field, then OY CF. 


The importance of o-fields will gradually become clear. 


Example 2.5. If Z is a o-field, then obviously a(F )= F. If LM consists 
of the singletons, then o(.9/) is the o-field in Example 2.4. If 7 is empty or 
= {QO} or A= {Q}, then o(.W) = {O, Q}. If Wo’, then of H) Colg’). 
If ICA Co(M), then o(.7) = 0(.07’). a 


Example 2.6. Let % be the class of subintervals of Q = (0, 1], and define 
@=0(Z). The elements of Z are called the Borel sets of the unit interval. 
The field Z, of Example 2.2 satisfies AC Z) c Z, and hence o(@)) = Z. 

Since @ contains the intervals and is a o-field, repeated finite and 
countable set-theoretic operations starting from intervals will never lead 
outside Z. Thus Z contains the set (2.2) of normal numbers. It also contains 
for example the open sets in (0, 1]: If G is open and x € G, then there exist 
rationals a, and b, such that x €(a,,b,] CG. But then G = U, ec (âx byl; 
since there are only countably many intervals with rational endpoints, G is a 
countable union of elements of -¥ and hence lies in 2. 

In fact, @ contains all the subsets of (0,1] actually encountered in 
ordinary analysis and probability. It is large enough for all “practical” 
purposes. It does not contain every subset of the unit interval, however; see 
the end of Section 3 (p. 45). The class @ will play a fundamental role in all 
that follows. E 


Probability Measures 


A set function is a real-valued function defined on some class of subsets of 
Q. A set function P on a field ¥ is a probability measure if it satisfies these 
conditions: 


(i) 0<P(A)<1for AE F, 

(ii) P(Ø)=0, P(Q)=1; 
: Gii) if A,,A>,... is a disjoint sequence of ¥Fsets and if U%_, A, E€ F, 
then 


(2.3) P| U4.) - ¥ P(A,). 
k=l k=1 


TAs the left side of (2.3) is invariant under permutations of the A, the same must be true of the 
right side. But in fact, according to Dirichlet’s theorem [A26], a nonnegative series has the same 
value whatever order the terms are summed in. 


SECTION 2. PROBABILITY MEASURES 23 


The condition imposed on the set function P by (iii) is called countable 
additivity. Note that, since ¥ is a field but perhaps not a o-field, it is 


necessary in (iii) to assume that UZ_, A, lies in F. If A,,..., A, are disjoint 
Fsets, then Uk, A, is also in F and (2.3) with A,4,;=Ans2= = 
gives 

n n 
Si P| Ua- È P(A): 

k=1 k=1 


The condition that (2.4) holds for disjoint sets is finite additivity; it is a 
consequence of countable additivity. It follows by induction on n that P is 
finitely additive if (2.4) holds for n =2—if P(A UB) =P(A)+P(B) for 
disjoint Asets A and B. 

The conditions above are redundant, because (i) can be replaced by 
P(A)= 0 and (ii) by P(Q) = 1. Indeed, the weakened forms (together with 
(iii)) imply that P(Q) = P(Q) + P(@) + P(@) + ---, so that P(Ø) = 0, and 
1 = P(Q) = P(A) + P(A‘), so that P(A) < 1. 


Example 2.7. Consider as in Example 2.2 the field @ of finite disjoint 
unions of subintervals of Q = (0,1]. The definition (1.3) assigns to each 
@p-set a number—the sum of the lengths of the constituent intervals—and 
hence specifies a set function P on 2). Extended inductively, (1.4) says that 
P is finitely additive. In Section 1 this property was deduced from the 
additivity of the Riemann integral (see (1.5)). In Theorem 2.2 below, the 
finite additivity of P will be proved from first principles, and it will be shown 
that P is, in fact, countably additive—is a probability measure on the field 
@,. The hard part of the argument is in the proof of Theorem 1.3, already 
done; the rest will be easy. a 


If F is a o-field in Q and P is a probability measure on £Z, the triple 
(Q, FY, P) is called a probability measure space, or simply a probability space. 
A support of P is any #set A for which P(A) =1. 


Example 2.8. Let Z be the o-field of all subsets of a countable space Q, 
and let p(w) be a nonnegative function on Q. Suppose that Z -, p(w) = 1, 
and define P(A)= L „< ,p(@); since p(w) > 0, the order of summation is 
irrelevant by Dirichlets theorem [A26]. Suppose that A = U;_, A;, where 
the A, are disjoint, and let w;,,@;,,... be the points in A;. By the theorem 
on nonnegative double series [A27], P(A) = ;,p(o;;) = 2;X;p(@;,;) = 
£,P(A;), and so P is countably additive. This (Q, Z, P) is a discrete 
probability space. It is the formal basis for discrete probability theory. a 


Example 2.9. Now consider a probability measure P on an arbitrary 
o-field F in an arbitrary space Q; P is a discrete probability measure if there 
exist finitely or countably many points w, and masses m, such that P(A) = 
Lo, c aM, for A in F. Here P is discrete, but the space itself may not be. In 


24 PROBABILITY 


terms of indicator functions, the defining condition is P(A)= Lem, Iw) 
for A e F. If the set {w;, wz,- } lies in Z, then it is a support of P. 

If there is just one of these points, say œo, with mass mọ = 1, then P is g 
unit mass at wọ. In this case P(A) = [,(w,) for AE F. a 


Suppose that P is a probability measure on a field F, and that A, B € F 
and A CB. Since P(A) +P = A) = P(B), P is monotone: 
(2.5) P(A) SPs) if AGB; 


It follows further that P(B — A) = P(B) — P(A), and as a special case, 


P(A‘) =1- P(A). 


(2.6) 

Other formulas familiar from the discrete theory are easily proved. For 
example, 
(2.7) PA) + PB) PAUB + PCA), 


the common value of the two sides being P(A U B®) + 2P(ANB)+PCA"A 


B). Subtraction gives 
(2.8) P(A UB) = P(A) + P(B) —P(ANB). 


This is the case n = 2 of the general inclusion-exclusion formula: 


i<j 


(2.9) P| U A, | = P(Ai) = LiP(A:04)) 
k=1 i 


+ E P(A41OA NA) +: +(-1)"" P(4,0 eee 


i<j<k 


To deduce this inductively from (2.8), note that (2.8) gives 


tad diran -e| U (A, Oa s 
k=1 


is) 


Applying (2.9) to the first and third terms on the right gives (2.9) with n + 1 


in place of n. 

af B =A, and B, =A; NASN -+ MAS_ then the B, are disjoint and 
bpa A, ae ae By, so that ros. A,) = yp Er Since P(B,)< 
P(A,) by monotonicity, this establishes the finite subadditivity of P: z 


(2.10) r| U a.) PE res 
l i a asi k=1 k=1 


me 
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Here, of course, the A, need not be disjoint. Sometimes (2.10) is called 
Boole’s inequality. 

In these formulas all the sets are naturally assumed to lie in the field 7. 
The derivations above involve only the finite additivity of P. Countable 
additivity gives further properties: 


Theorem 2.1. Let P be a probability measure on a field F. 


(i) Continuity from below: If A, and A lie in F and‘ A,î A, then 
P(A,,)t P(A). 

(ii) Continuity from above: If A, and A lie in F and A„} A, then 
P(A,,) P(A). 

Gii) Countable subadditivity: If A,, A ,... and UZ. A, lie in F (the A, 
need not be disjoint), then 


(2.11) P| v a, < T P(A,). 


Proor. For (i), put B,=A, and B,=A,—A,_,. Then the B, are 
disjoint, A = U- B,, and A, = U%_, B,, so that by countable and finite 
additivity, P(A) = D%_, P(B,) = lim, D"_,P(B,) = lim, P(A,,). For Gi), ob- 
serve that A, | A implies A‘ 1 A‘, so that 1 — P(A,)11— P(A). 

As for (iii), increase the right side of (2.10) to LZ_,P(A,) and then apply 
part (i) to the left side. a 


Example 2.10. In the presence of finite additivity, a special case of (ii) 
implies countable additivity. Jf P is a finitely additive probability measure on 
the field F, and if A, }Ø for sets A, in F implies P(A,) 10, then P is 
countably additive. Indeed, if B = U, B, for disjoint sets B, (B and the B, in 
F), then C, = Uk >n B, =B- Up <n B, lies in the field F, and C, LØ. The 
hypothesis, together with finite additivity, gives P(B) — LZ_,P(B,) = 
P(C,,) > 0, and hence P(B) = Ly_, P(B,). a 


Lebesgue Measure on the Unit Interval 


The definition (1.3) specifies a set function on the field @, of finite disjoint 
unions of intervals in (0, 1]; the problem is to prove P countably additive. It 
will be convenient to change notation from P to A, and to denote by A the 
class of subintervals (a, b] of (0, 1]; then A(7) = |J| = b — a is ordinary length. 
Regard © as an element of Z of length 0. If A= Uf, I; the J, being 


‘For the notation, see [A4] and [A10]. 
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disjoint sets, the definition (1.3) in the new notation is 


(2.12) Ae SNe N 


i=) i=l 


As pointed out in Section 1, there is a question of uniqueness here, because 
A will have other representations as a finite disjoint union U7", J; of A-sets. 
But 7 is closed under the formation of finite intersections, and so the finite 
form of Theorem 1.3(iii) gives 


n n 


(2.13) L Uil= 2 a A vaak 


i=1 i=1 j=1 j=l 


(Some of the J; J; may be empty, but the corresponding lengths are then 0.) 
The definition is indeed consistent. 

Thus (2.12) defines a set function A on 2, a set function called Lebesgue 
measure. 


Theorem 2.2. Lebesgue measure à is a (countably additive) probability 
measure on the field By. 


Proor. Suppose that A = UŞ- 4,, where A and the A, are B,-sets 
and the A, are disjoint. Then A= Uj7_,/; and A, = U7, J,; are disjoint 
unions of .~sets, and (2.12) and Theorem 1.3(iii) give 


o0 my 


SAO MPL a (1 


i=] i=l k=1 j=1 


(2.14) A( A) 


co Mk 


mye De A= A(A,). B 
=] 


k=1 j=1 k 


In Section 3 it is shown how to extend A from @, to the larger class 
B= o(&,) of Borel sets in (0, 1]. This will complete the construction of A as 
a probability measure (countably additive, that is) on Ø, and the construction 
is fundamental to all that follows. For example, the set N of normal numbers 
lies in Z (Example 2.6), and it will turn out that ACN) = 1, as probabilistic 
intuition requires. (In Chapter 2, A will be defined for sets outside the unit 
interval as well.) 

It is well to pause here and consider just what is involved in the construc- 
tion of Lebesgue measure on the Borel sets of the unit interval. That length 
defines a finitely additive set function on the class / of intervals in (0, 1] is a 
consequence of Theorem 1.3 for the case of only finitely many intervals and 
thus involves only the most elementary properties of the real number system. 
But proving countable additivity on s requires the deeper property of 
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compactness (the Heine-Borel theorem). Once A has been proved countably 
additive on Z, extending it to 2) by the definition (2.12) presents no real 
difficulty: the arguments involving (2.13) and (2.14) are easy. Difficulties again 
arise, however, in the further extension of À from B, to @=0(B)), and 
here new ideas are again required. These ideas are the subject of Section 3, 
Where it is shown that any probability measure on any field can be extended 
to the generated o-field. 


Sequence Space* 


Let S be a finite set of points regarded as the possible outcomes of a simple 
observation or experiment. For tossing a coin, S can be {H, T} or {0,1}; for 
rolling a die, S = {1,...,6}; in information theory, S plays the role of a finite 
alphabet. Let Q = S” be the space of all infinite sequences 


(2.15) w= (z,(w),2,(w),...) 


of elements of S: z,(w)€S for all w € S$” and k > 1. The sequence (2.15) 
can be viewed as the result of repeating infinitely often the simple experi- 
ment represented by S. For S = {0,1}, the space S$” is closely related to the 
unit interval; compare (1.8) and (2.15). 

The space $” is an infinite-dimensional Cartesian product. Each z,(-) is a 
mapping of $” onto S; these are the coordinate functions, or the natural 
projections. Let S” =S X --- XS be the Cartesian product of n copies of S; 
it consists of the n-long sequences (u,,...,u,,) of elements of S. For such a 
sequence, the set 


(2.16) [w: (AON 5 z,(@)) = (ues) 

represents the event that the first n repetitions of the experiment give the 
outcomes u,,...,u,, in sequence. A cylinder of rank n is a set of the form 
(2.17) A=[o:(z,(@),...,2,(@)) €H], 


where H CS". Note that A is nonempty if H is. If H is a singleton in S$”, 
(2.17) reduces to (2.16), which can be called a thin cylinder. 

Let & be the class of cylinders of all ranks. Then @ is a field: $” and 
the empty set have the form (2.17) for H = S” and for H = Ø. If H is 
replaced by S$” — H, then (2.17) goes into its complement, and hence @ is 


“The ideas that follow are basic to probability theory and are used further on, in particular in 
Section 24 and (in more elaborate form) Section 36. On a first reading, however, one might 
prefer to skip to Section 3 and return to this topic as the need arises. 
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closed under complementation. As for unions, consider (2.17) together with 
(2.18) B=[:(z,(@),...52m(@)) E1], 


a cylinder of rank m. Suppose that n < m (symmetry); if H’ consists of the 
sequences (u,...,U,,) in S” for which the truncated sequence (u,,...,u,) 


lies in H, then (2.17) has the alternative form 

(2.19) A = fo: (z)(@),.++5Zm(@)) EH]: 
Since it is now clear that 

(2.20) AUB=[@:(z\(@),.-.,Zn(@)) EH UT] 


is also a cylinder, @ is closed under the formation of finite unions and hence 
is indeed a field. 

Let p,, u ES, be probabilities on S—nonnegative and summing to 1. 
Define a set function P on @ (it will turn out to be a probability measure) 
in this way: For a cylinder A given by (2.17), take 


(2.21) P(A) = ape TD 
H 


the sum extending over all the sequences (u;,..., u„) in H. As a special case, 
(2.22) EE A ho) ) = (uiau =p). spk 


Because of the products on the right in (2.21) and (2.22), P is called product 
measure; it provides a model for an infinite sequence of independent repeti- 
tions of the simple experiment represented by the probabilities p, on S. In 
the case where S= {0,1} and pọ =p; = +4, it is a model for independent 
tosses of a fair coin, an alternative to the model used in Section 1. 

The definition (2.21) presents a consistency problem, since the cylinder A 
will have other representations. Suppose that A is also given by (2.19). If 
n =m, then H and H’ must coincide, and there is nothing to prove. Suppose 
then (symmetry) that n <m. Then H’ must consist of those (u,,...,u,,) in 
S” for which (u,,...,u,) lies in H: H' = H.x S™-” But then 


(2.23) Epu eee e a a ep 


The definition (2.21) is therefore consistent. And finite additivity is now easy: 
Suppose that A and B are disjoint cylinders given by (2.17) and (2.18). 
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Suppose that n <m, and put A in the form (2.19). Since A and B are 
disjoint, H’ and J must be disjoint as well, and by (2.20), 


(2.24) P(AUB)= ¥ pu" Pu, = P(A) #P(B). 
H'UI 


Taking H = S” in (2.21) shows that P(S”) = 1. Therefore, (2.21) defines a 
finitely additive probability measure on the field Gp. 

Now, P is countably additive on @, but this requires no further argument, 
because of the following completely general result. 


Theorem 2.3. Every finitely additive probability measure on the field 6g of 
cylinders in S” is in fact countably additive. 


The proof depends on this fundamental fact: 


Lemma. If A, | A, where the A, are nonempty cylinders, then A is 
nonempty. 


PROOF OF THEOREM 2.3. Assume that the lemma is true, and apply 
Example 2.10 to the measure P in question: If A, |@ for sets in @% 
(cylinders) but P(A,) does not converge to 0, then P(A,,) = e€ > 0 for some 
e. But then the A, are nonempty, which by the lemma makes A, lØ 
impossible. m 


PROOF OF THE Lemma.’ Suppose that A, is a cylinder of rank m,, say 


(2.25) A, =[@: (2\(@),+++1Zm(@)) © Ai, 


where H, cS”. Choose a point w, in A,, which is nonempty by assumption. 
Write the components of the sequences in a square array: 


Z(@,) 2,(@2)  2;(@3) 
(2.26) Z(@,) 2Z2(@2) Z2(@s) 


The nth column of the array gives the components of w,,. 

Now argue by a modification of the diagonal method [A14]. Since S is 
finite, some element u, of S appears infinitely often in the first row of (2.26): 
for an increasing sequence {n, ,} of integers, z\(w,,_) =u, for all k. By the 
same reasoning, there exist an meregsing subsequenċe {n, ,} of {n, ,} and an 


‘The lemma is a special case of Tychonov’s theorem: If S is given the discrete topology, the 
topological product S* is compact (and the cylinders are closed). 


30 PROBABILITY 


element u, of S such that z(,,, )=u, for all k. Continue. If n, = hyp, 
then z,(w,,)=u, for k>r, and hence (2,(Wp,)y 4-45 Zn) i esiet) 
fork>r. 

Let w° be the element of S” with components u,: w° =(u,,u,,...)= 
(z,(@°), z,(@°),...). Let t be arbitrary. If k >t, then (n, is increasing) n, >; 
and hence w, €A,, CA,. It follows by (2.25) that, for k >t, H, contains the 
point (2,(@,,), +++) Zm(@n,)) of Ss”, But for k =m, this point is identical 
with (z,(°),..., Z(@°)), which therefore lies in H,. Thus w° is a point 
common to all the A,. = 


Let @ be the o-field in S” generated by 6p. By the general theory of the 
next section, the probability measure P defined on @ by (2.21) extends to 
€&€. The term product measure, properly speaking, applies to the extended Pp. 
Thus (S*,@, P) is a probability space, one important in ergodic theory 
(Section 24). 


Suppose that S = {0,1} and pọ =p, = 4. In this case, (S*, @, P) is closely related to 
((0, 1], Z, A), although there are essential differences. The sequence 15) can end in 
0’s, but (1.8) cannot. Thin cylinders are like dyadic intervals, but the sets in G (the 
cylinders) correspond to the finite disjoint unions of intervals with dyadic endpoints, a 
field somewhat smaller than @). While nonempty sets in A (for example, G> + 
2-"]}) can contract to the empty set, nonempty sets in fg cannot. The lemma above 
plays here the role the Heine-Borel theorem plays in the proof of Theorem 1.3. The 
product probability measure constructed here on @ (in the case S = {0, i}; we = Ph 
=4, that is) is analogous to Lebesgue measure on Bo. But a finitely additive 
probability measure on @, can fail to be countably additive,’ which cannot happen 


in G. 


Constructing o-Fields* 


The a-field a(.27) generated by 2 was defined from above or from the outside, so to 
speak, by intersecting all the o-fields that contain .o/ (including the o-field consisting 
of all the subsets of Q). Can o(.27) somehow be constructed from the inside by 
repeated finite and countable set-theoretic operations starting with sets in 7? 

For any class & of sets in Q let #* consist of the sets in #, the complements 
of sets in WY, and the finite and countable unions of sets in #. Given a class %/, put 
LH, = WM and define X, A, ... inductively by 


(2.27) A, = A* 


n n-1° 


That each 2% is contained in o() follows by induction. One might hope that 
A, = 0(M) for some n, or at least that U*_,)&% =o (Æ). But this process applied to 
the class of intervals fails to account for all the Borel sets. 

Let p consist of the empty set and the intervals in Q =(0,1] with rational 
endpoints, and define 4 =.4*, for n=1,2,.... It will be shown that U*_)4%, is 
strictly smaller than B = o (A). 


"See Problem 2.15. 
“This topic may be omitted. 
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If a, and b, are rationals decreasing to a and b, then (a, b] = U m N alams bil= 
U mlU „Cam, bp 1°) € 4%. The result would therefore not be changed by including in 
o all the intervals in (0, 1]. 

To prove U}_».4% smaller than @, first put 


(2.28) P(A Az) AG UA UAUA O * 8, 


Since .%,_, contains Q = (0, 1] and the empty set, every element of Z, has the form 
(2.28) for some sequence A, A,... of sets in .4_,. Let every positive integer 
appear exactly once in the square array 


Mı Miz 


Mz, Mz 


Inductively define 


(229) = D (A 45 EA 
®,( Ay, A2) = PPr a Amn Amae) Prot Anes 


n= 1720535 


It follows by induction that every element of Æ has the form ®,(A,, A,,...) for 
some sequence of sets in Ag- Finally, put 


(2.30) @(A;,A3}.29) 20, (AP OAL A O C4 ean) Oe 


mi2?" 


Then every element of U%,_»-4% has the form (2.30) for some sequence A), A;,... 
of sets in ^A. 

If A,, Az,... are in Z, then (2.28) is in @; it follows by induction that each 
®,(A,, A>,...) is in Z and therefore that (2.30) is in Z. 

With each w in (0, 1] associate the sequence (w4, w2,...) of positive integers such 
that w,+-:: +, is the position of the kth 1 in the nonterminating dyadic 
expansion of w (the smallest n for which L7_,d(w)=k). Then w (w,,@>,...)isa 
one-to-one correspondence between (0,1] and the set of all sequences of positive 
integers. Let J}, J,,... be an enumeration of the sets in A, put p(w) = DOES Isy Se 
and define B=[w: w € g(w)]. It will be shown that B is a Borel set but is not 
contained in any of the £. 

Since w lies in B if and only if w lies outside p(w), B + (w) for every w. But 
every element of U% o, has the form (2.30) for some sequence in A} and hence 
has the form (w) for some w. Therefore, B is not a member of UŞ- 0⁄4. 

It remains to show that B is a Borel set. Let D; =[w: w €/,, ]. Since L,(n) =[a: 
w+ +w, =n) =[o: L'={d(w) <k = L}_,d,(@)] is a Borel set, so are [w: w, = 
n)= U%_,L,_0m) NL,(m +n) and 


D,= [w: wel, | = U ([#: o, =n] n). 
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Suppose that it is shown that 


t) 


(2.31) |: TEKA E it )| E ONDI DINA] 


for every n and every sequence uj, Uz,... of positive integers. It will then follow from 


the definition (2.30) that 


B =o. E g(w)] = U [o: w E Dyl Lan lamg T )| 


n=l 


EDs, Duet) DD 


n=1 


But as remarked above, (2.30), is a Borel set if the 4,, are. Therefore, (2.31) will imply 
that B“ and B are Borel sets. 

If n = 0, (2.31) holds because it reduces by (2.29) to [w: w € Is 4) =D, Suppose 
that (2.31) holds with n — 1 in place of n. Consider the condition 


(2.32) A (eT, ee: 


By (2.28) and (2.29), a necessary and sufficient condition for w € AG Pp TS 


that either (2.32) is false for k = 1 or else (2.32) is true for some k exceeding 1. But by 
the induction hypothesis, (2.32) and its negation can be replaced by w € 
®,_(D, ,D,, ,---) and its negation. Therefore, w €$, Qu, 1 ..) if and only if 


@€®(D,,D,,,...). 

Thus U,.% + &, and there are Borel sets that cannot be arrived at from the 
intervals by any finite sequence of set-theoretic operations, each operation being finite 
or countable. It can even be shown that there are Borel sets that cannot be arrived at 
by any countable sequence of these operations. On the other hand, every Borel set 
can be arrived at by a countable ordered set of these operations if it is not required 
that they be performed in a simple sequence. The proof of this statement—and 
indeed even a precise explanation of its meaning—depends on the theory of infinite 
ordinal numbers.* 


>° 
M2 


PROBLEMS 


2.1. Define x V y = max{x, y}, and for a collection {x,} define V, x, =sup, Xa; 
define x A y = min{x, y} and A „Xa = inf, x,. Prove that 4, ,=1, V Ip, Lan 
=I, NI, Iye=1—Iy, and 144g = |4- Ipl, in the sense that there is equality 
at each point of ©. Show that A CB if and only if Z, <1, pointwise. Check 
the equation x A(y Vz)=(xAy)V(x Az) and deduce the distributive law 


See Problem 2.22. 


SECTION 2. PROBABILITY MEASURES 33 


2.2. 


2.3. 


2.4. 


2.5. 


2.6. 


2.7. 


AN(BUC)=(ANB)U(ANC), By similar arguments prove that 


AU(BNC)=(AUB)N(AUC), 
AAC €(AAB) U (BAC), 


(Ua) = 94, 


( N4,) = U Ar 


Aa AR An be arbitrary events, and put U, = U(A; N +: NA;,) and 
= NA, ‘UD ‘UA; o) where the union and intersection lekla over all the 
i -tuples satisfying 1 <i < +++ <i, Sh. Show that O= kA: 


(a) Suppose that QE F and that A,Be F implies A—B=ANB‘E F. 
Show that ¥ is a field. 

(b) Suppose that Q € ¥ and that F is closed under the formation of comple- 
ments and finite disjoint unions. Show that F need not be a field. 


Let Fi, Y,,... be classes of sets in a common space Q. 
(a) Suppose that F, are fields satisfying Z, C F,,,. Show that UF Z, isa 
field. 


(b) Suppose that Z, are o-fields satisfying Z, C F,, ,. Show by example that 
Us _1%, need not be a o-field. 


The field f(.27) generated by a class in Q is defined as the intersection of all 
fields in © containing Z. 

(a) Show that f() is indeed a field, that o/c f(.07), and that f(.7/) is 
minimal in the sense that if # is a field and o/c Y, then f(A) CZF. 

(b) Show that for nonempty Z, f(27) is the class of sets of the form 
FET (lis 1Ajj, where for each i and j either A;; E Z or Af; E Æ, and where 
the m sets N Ft; 1<i<m, are disjoint. The sets in fC) can thus be 
explicitly presented, which is not in general true of the sets in a(.2/). 


T (a) Show that if Z consists of the singletons, then f(a) is the field in 
Example 2.3. 


(b) Show that f(.07)Co(o/), that f(.2/)=0(/) if of is finite, and that 
a( f()) =a( 7). 
(c) Show that if Z is countable, then f(./) is countable. 


(d) Show for fields F; and F, that f(F%, U Fz) consists of the finite disjoint 
unions of sets A, NA, with A; E Z. Extend. 


2.57 Let H be aa set lying outside F, where F is a field [or o-field]. Show 
that the field [or o-field] generated by FU {H} consists of sets s of the form 


(2.33) (HNA)U(HE OB), A, BEF. 
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2.8. 


2.9. 


2.10. 


2.11. 


2.12. 


2.13. 


2.14. 


2.15. 


2.16. 


PROBABILITY 


Suppose for each A in æ that A‘ is a countable union of elements of Y. The 
class of intervals in (0, 1] has this property. Show that o(.o%7) coincides with the 


smallest class over .&% that is closed under the formation of countable unions 
and intersections. 


Show that, if B € o(.o7), then there exists a countable subclass p of Æ such 
that BE oZ). 


(a) Show that if o(.97) contains every subset of ©, then for each pair w and «y 
of distinct points in © there is in o7 an A such that 1,(w) 4 14w’). 


(b) Show that the reverse implication holds if Q is countable. 


(c) Show by example that the reverse implication need not hold for uncount- 
able Q. 


A o-field is countably generated, or separable, if it is generated by some 
countable class of sets. 


(a) Show that the o-field @ of Borel sets is countably generated. 


(b) Show that the o-field of Example 2.4 is countably generated if and only if Q 
is countable. 


(c) Suppose that F, and F, are o-fields, Fi C Fa, and F, is countably 
generated. Show by example that F, may not be countably generated. 


Show that a o-field cannot be countably infinite—its cardinality must be finite 


or else at least that of the continuum. Show by example that a field can be 
countably infinite. 


(a) Let F be the field consisting of the finite and the cofinite sets in an infinite 
Q, and define P on ¥ by taking P(A) to be 0 or | as A is finite or cofinite. 
(Note that P is not well defined if © is finite.) Show that P is finitely additive. 
(b) Show that this P is not countably additive if Q is countably infinite. 

(c) Show that this P is countably additive if is uncountable. 


(d) Now let F be the o-field consisting of the countable and the cocountable 
sets in an uncountable Q, and define P on F by taking P(A) to be Oor las A 


is countable or cocountable. (Note that P is not well defined if Q is countable.) 
Show that P is countably additive. 


In (0, 1] let Z be the class of sets that either (i) are of the first category [A15] or 
(ii) have complement of the first category. Show that F is a o-field. For A in 


F, take P(A) to be 0 in case (i) and 1 in case (ii). Show that P is countably 
additive. 


On the field Z, in (0, 1] define P(A) to be 1 or 0 according as there does or 
does not exist some positive e4 (depending on A) such that A contains the 


interval (353 + €,]. Show that P is finitely but not countably additive. No such 
example is possible for the field fo in S” (Theorem 2.3). 


(a) Suppose that P is a probability measure on a field F. Suppose that A, € F 
for t>0, that A,CA, for s <t, and that A = U 1>0A, E€ F. Extend Theorem 
2.1) by showing that P(A,)t P(A) as t > œ. Show that A necessarily lies in F 
if it is a o-field. 


(b) Extend Theorem 2.1(ii) in the same way. 
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2.17. 


2.18. 


2.19. 


Suppose that P is a probability measure on a field F, that A;,A2,..., and 
A=U,A,, lie in F, and that the A, are nearly disjoint in the sense that 
P(A, OA) =0 for m +n. Show that P(A) =X, P(A,). 


anit asila Define a set function P, on the class of all subsets of 
= 9 Syren by 


(2.34) P,(A) = 2 #[m: 1<m<n,meA]; 


among the first n integers, the proportion that lie in A is just P,(A). Then P, is 
a discrete probability measure. The set A has density 


(2.35) D(A) = limP,(A), 


provided this limit exists. Let 2 be the class of sets having density. 

(a) Show that D is finitely but not countably additive on 2. 

(b) Show that 2 contains the empty set and Q and is closed under the 
formation of complements, proper differences, and finite disjoint unions, but is 
not closed under the formation of countable disjoint unions or of finite unions 
that are not disjoint. 


(c) Let consist of the periodic sets M, =[ka: k =1,2,...]. Observe that 
lin 1 
(2.36) FMIS A or AD MOY 


Show that the field f(.#) generated by H (see Problem 2.5) is contained in 2. 
Show that D is completely determined on f(./) by the value it gives for each a 
to the event that m is divisible by a. 

(d) Assume that £ p~! diverges (sum over all primes; see Problem 5.20(e)) and 
prove that D, although finitely additive, is not countably additive on the field 
f(M). 

(e) Euler’s function y(n) is the number of positive integers less than n and 
relatively prime to it. Let p,,..., p, be the distinct prime factors of n; from the 
inclusion-exclusion formula for the events [m: p;|m], (2.36), and the fact that 
the p; divide n, deduce 


(2.37) oy) = (1 a =}. 


pin 


(f) Show for 0 <x <1 that D(A)=x for some A. 
(g) Show that D is translation invariant: If B =[m + 1: m €A], then B has a 
density if and only if A does, in which case D(A) = D(B). 


A probability measure space (Q, F, P) is nonatomic if P(A)>0 implies that 
there exists a B such that BCA and 0<P(B)<P(A) (A and B in F, of 
course). 

(a) Assuming the existence of Lebesgue measure A on &, prove that it is 
nonatomic. 

(b) Show in the nonatomic case that P(A) > 0 and e > 0 imply that there exists 
a B such that B CA and 0 < P(B) <€. 
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2.20. 


2.21. 


2.22. 


2.23. 


PROBABILITY 


(c) Show in the nonatomic case that 0 <x < P(A) implies that there exists a B 
such that BCA and P(B)=x. Hint: Inductively define classes %, numbers 
has and sets H,, by Ho F (Ø) + {Ho}, H,, = A HGA VU pend, 
POOR OFEN =x); hy = supl P(H): He #%), and P(H,)> h,- n-!. 
Consider U , Hg. 

(d) Show in the nonatomic case that, if p,, P2,..- are nonnegative and add to 1, 
then A can be decomposed into sets B,, B,,... such that P(B,,) =p, PCA). 


Generalize the construction of product measure: For n = Zm He Met S, be a 
finite space with given probabilities Pp U E€ S,,. Let $; x S$, x «++ be the space 
of sequences (2.15), where now z,(@) E S,. Define P on the class of cylinders, 
appropriately defined, by using the product Piu, °° * Pnu, On the right in (2.21). 
Prove P countably additive on 6p, and extend Theorem 2.3 and its lemma to 
this more general setting. Show that the lemma fails if any of the S, are infinite. 


(a) Suppose that 7={A,, A,,...} is a countable partition of ©. Show (see 
(2.27)) that 27, = Až = .o7* coincides with o(.o7). This is a case where o(.o7) 
can be constructed “from the inside.” 


(b) Show that the set of normal numbers lies in %. 


(c) Show that #* = X if and only if & is a o-field. Show that 4,_, is 
strictly smaller than Z, for all n. 

Extend (2.27) to infinite ordinals a by defining Z, =(U p < a Xg)”. Show that, if 
Q is the first uncountable ordinal, then U „a< 9%, =a). Show that, if the 
cardinality of 7 does not exceed that of the continuum, then the same is true 
of o(& ). Thus Z has the power of the continuum. 


* Extend (2.29) to ordinals a < as follows. Replace the right side of (2.28) 
bye (AL, ,UAS,,). Suppose that ®, is defined for B <a. Vet 
B,.(1), B,{2),... be a sequence of ordinals such that B,(n) <a and such that if 
B <a, then B =B,(n) for infinitely many even n and for infinitely many odd n; 
define 


(2.38) ®,(A,, Ao,---) 
= (Oa i Ze ae T O Maan, amen) so) 


Prove by transfinite induction that (2.38) is in @ if the A, are, that every 
element of Z, has the form (2.38) for sets A,, in A, and that (2.31) holds with 
a in place of n. Define 9,(w)=®,U/,,,/,,,--.), and show that B, =[w: 
w €¢g,(w)] lies in Z- Z, for a < Q. Show that Z, is strictly smaller than 4 
fora <B<. 


SECTION 3. EXISTENCE AND EXTENSION 


The main theorem to be proved here may be compactly stated this way: 


Theorem 3.1. A probability measure on a field has a unique extension to 
the generated o-field. 


In more detail the assertion is this: Suppose that P is a probability 


measure on a field A, of subsets of Q, and put F=o(A,). Then there 
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exists a probability measure Q on Z such that QCA) = P(A) for A E Fp. 
Further, if Q' is another probability measure on 7 such that Q'(A) = PCA) 
for A E A, then Q'(A) = QCA) for A € F. 

Although the measure extended to F is usually denoted by the same 
letter as the original measure on Ap, they are really different set functions, 
since they have different domains of definition. The class F} is only assumed 
finitely additive in the theorem, but the set function P on it must be assumed 
countably additive (since this of course follows from the conclusion of the 
theorem), 

As shown in Theorem 2.2, A (initially defined for intervals as length: 
AC) =|J|) extends to a probability measure on the field A, of finite disjoint 
unions of subintervals of (0, 1]. By Theorem 3.1, A extends in a unique way 
from Z, to @=a(A,), the class of Borel sets in (0, 1]. The extended A is 
Lebesgue measure on the unit interval. Theorem 3.1 has many other applica- 
tions as well. 

The uniqueness in Theorem 3.1 will be proved later; see Theorem 3.3. The 
first project is to prove that an extension does exist. 


Construction of the Extension 


Let P be a probability measure on a field A). The construction following 
extends P to a class that in general is much larger than o (7g) but nonethe- 
less does not in general contain all the subsets of Q. 

For each subset A of Q, define its outer measure by 


(3.1) P*(A) =ni PR GAS): 


where the infimum extends over all finite and infinite sequences A,, A>,... 
of .A,-sets satisfying A C U,, A,. If the A, form an efficient covering of A, 
in the sense that they do not overlap one another very much or extend much 
beyond A, then L,,P(A,,) should be a good outer approximation to the 
measure of A if A is indeed to have a measure assigned it at all. Thus (3.1) 
represents a first attempt to assign a measure to A. 

Because of the rule P(A‘) = 1— P(A) for complements (see (2.6)), it is 
natural in approximating A from the inside to approximate the complement 
A‘ from the outside instead and then subtract from 1: 


(3.2) P,(A) =1-P*(A‘). 


This, the inner measure of A, is a second candidate for the measure of A.t A 
plausible procedure is to assign measure to those A for which (3.1) and (3.2) 


‘An idea which seems reasonable at first is to define P,(A) as the supremum of the sums 
L,, P(A,,) for disjoint sequences of Fysets in A. This will not do. For example, in the case 
where Q is the unit interval, F) is 2, (Example 2.2), and P is A as defined by (2.12), the set N 
of normal numbers would have inner measure 0 because it contains no nonempty elements of 
Bə; in a satisfactory theory, N will have both inner and outer measure 1. 
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agree, and to take the common value P*(A) =P,,(A) as the measure. Since 
(3.1) and (3.2) agree if and only if 


(3.3) P°( A) P(A) = 1, 


the procedure would be to consider the class of A satisfying (3.3) and use 


P*(A) as the measure. 
It turns out to be simpler to impose on A the more stringent requirement 


that 


(3.4) PA iS) A PACATE =?) 


hold for every set E; (3.3) is the special case E = Q, because it will turn out 

that P*(Q) =1.'A set A is called P*-measurable if (3.4) holds for all E; let 

A be the class of such sets. What will be shown is that 4 contains o( A) 

and that the restriction of P* to a(.Y,) is the required extension of P. 
The set function P* has four properties that will be needed: 


(i) P*(@) = 0; 
(ii) P* is nonnegative: P*(A) > 0 for every A CQ); 
(iii) P* is monotone: A CB implies P*(A) < P*(B); 
(iv) P* is countably subadditive: P*(U,, A,,) < L,,P*(A,,). 


The others being obvious, only (iv) needs proof. For a given e, choose 
7, scien By, suchi that. AE U, B,, and L,P(B,,~)<P*(A;) +624) 
which is possible by the definition (3.1). Now U, A, CU, k Bng» so that 
P*(U, A,,) <L,.,P(B,,) < L,P*(A,,) + €, and (iv) follows.* Of course, (iv) 
implies finite subadditivity. 

By definition, A lies in the class of P*-measurable sets if it splits each 
E in 2° in such a way that P* adds for the pieces—that is, if (3.4) holds. 
Because of finite subadditivity, this is equivalent to 


(3.5) P*( ANE) + P*(A°NE) <P*(E). 


Lemma 1. The class 4 is a field. 


‘It also turns out, after the fact, that (3.3) implies that (3.4) holds for all E anyway; see Problem 
3.2. 
*Compare the proof on p. 9 that a countable union of negligible sets is negligible. 
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_ PROOF. It is clear that Q ©. and that .M is closed under complementa- 
tion. Suppose that A, B €% and ECQ. Then 


P*(E) = P*(BOE) + P*(B°O EB) 
=P*(ANBNE) +P*(A°ABNE) 
+ P*(ANB°OE) + P*( ASN BNE) 
> P*(ANBNE) 
+ P*((A°N BOE) U(ANBSNE) U(ASNBoNE)) 


= P*((ANB) NE) + P*((ANB) NE), 


the inequality following by subadditivity. Hencet' A NBE, and A is a 


? 


field. E 


Lemma 2. I[f A,, A>,... is a finite or infinite sequence of disjoint sets, 
then for each E CQ, 


(3.6) p* 


EN | U A,)] =} P*(ENA,). 


k k 


Proor. Consider first the case of finitely many A,, say n of them. For 
n = 1, there is nothing to prove. In the case n = 2, if A, UA, = Q, then (3.6) 
is just (3.4) with A, (or A,) in the role of A. If A, UA, is smaller than Q, 
split EM(A,UA,) by A, and Af (or by A, and AS) and use (3.4) and 
disjointness. 

Assume (3.6) holds for the case of n — 1 sets. By the case n = 2, together 
with the induction hypothesis, P*(E N (U%_, A,)) =P*(E 0 (UŁ! A,)) + 
P*(ENAy) =D, PCE oA: 

Thus (3.6) holds in the finite case. For the infinite case use monotonicity: 
P*(E A (Ux-; A,)) = P*CE A (URL; Ay) = Dh P*(ENA;,). Let n> œ, 
and conclude that the left side of (3.6) is greater than or equal to the right. 
The reverse inequality follows by countable subadditivity. a 


Lemma 3. The class & is a o-field, and P* restricted to & is countably 
additive. 


Proor. Suppose that A,, A>,... are disjoint sets with union A. Since 
F, = Uj, A, lies in the field 4, P*(E) = P*(E N F,) + P*(E Nn F£). To the 


‘This proof does not work if (3.4) is weakened to (3.3). 
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first term on the right apply (3.6), and to the second term apply monotonicity 
(Ft DA‘); P*(E) > Li_,P*(ENA,) + P*(E OA®). Let n > © and use (3.6) 
again: P*(E)> L?_,P*(ENA,) + PE NA‘) = P*CE NA) + P*(EN AS). 
Hence A satisfies (3.5) and so lies in 4, which is therefore closed under the 
formation of countable disjoint unions. 

From the fact that . is a field closed under the formation of countable 
disjoint unions it follows that AM is a o-field (for sets B; in 4, let A, =B, 
and ABABA NB 43 then the A, are disjoint Asets and 
U, B, = Uk A, € 4). The countable additivity of P* on # follows from 
(3.6): take E =). m 


Lemmas 1. 2, and 3 use only the properties (i) through (iv) of P* derived 
above. The next two use the specific assumption that P* is defined via (3.1) 
from a probability measure P on the field Žo. 


Lemma 4. If P* is defined by (3.1), then Fg C M. 


Proor. Suppose that A € Y,. Given E and e, choose Fo-sets A, such 
fiat 2 UW, A and ©) P(4,) = PCE) Fe. The sets B, = An 1A and €7— 
A, NAS lie in Fp because it is a field. Also, E OA CU, B, and E AASE 
U,,C,,; by the definition of P* and the finite additivity of P, P*(E@AE 
P*(E 0A‘) <¥, P(B,) +L, P(C,) =2,P(A,) < P*(E) + e. Hence AE F 
implies (3.5), and so y CH. a 


Lemma 5. If P* is defined by (3.1), then 
(3.7) P*(A) =P(A) for Ae Fy. 


Proor. It is obvious from the definition (3.1) that P*(A) < P(A) for A 
in Y,. If A C U, A,, where A and the A, are in Yo, then by the countable 
subadditivity and monotonicity of P on p P(A)<L,P(ANA,)< 
EPA.) Hence (3.7). a 


PROOF OF EXTENSION IN THEOREM 3.1. Suppose that P* is defined via 
(3.1) from a (countably additive) probability measure P on the field Yo. Let 
F=oa(F,). By Lemmas 3 and 4,' 


FC HEMET. 
By (3.7), P*(Q) = P(Q) = 1. By Lemma 3, P* (which is defined on all of 2°) 


restricted to W is therefore a probability measure there. And then P* 
further restricted to ¥ is clearly a probability measure on that class as well. 


"In the case of Lebesgue measure, the relation is Z) c @C.#C 2"), and each of the three 
inclusions is strict; see Example 2.2 and Problems 3.14 and 3.21. . 
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This measure on ¥ is the required extension, because by (3.7) it agrees with 
Pon Jp. a 


Uniqueness and the 7-A Theorem 


To prove the extension in Theorem 3.1 is unique requires some auxiliary 
concepts. A class # of subsets of Q is a a-system if it is closed under the 
formation of finite intersections: 


(7) A,BE 2? implies ANB E€ B. 


A class -⁄/ is a A-system if it contains Q and is closed under the formation of 
complements and of finite and countable disjoint unions: 


(A,) QEZ; 
(A,) AGL implies 4° €_7; 
CAR) AL Ae ees ef and A, N Ap = for m +n imply U, A, © 2. 


Because of the disjointness condition in (A;), the definition of A-system is 
weaker (more inclusive) than that of o-field. In the presence of (A,) and (A2), 
which imply Ø €_/, the countably infinite case of (A;) implies the finite one. 

In the presence of (A,) and (A,), (A) is equivalent to the condition that 2 
is closed under the formation of proper differences: 


OJD A BEZ and ACB imply B—A EZ. 


Suppose, in fact, that -/ satisfies (A,) and (,,). If -ABE Z and ACB, 
then / contains B“, the disjoint union AUB‘, and its complement (A U 
B°): =B—A. Hence (X,). On the other hand, if / satisfies (A,) and (A3), 
then A €_Z implies A‘ = Q —A € Z. Hence (A3). 

Although a o-field is a A-system, the reverse is not true (in a four-point 
space take “ to consist of Ø, Q, and the six two-point sets). But the 


connection is close: 


Lemma 6. A class that is both a m-system and a A-system is a o-field. 


Proor. The class contains Q by (A,) and is closed under the formation 
of complements and finite intersections by (A2) and (7). It is therefore a 
field. It is a o-field because if it contains sets A,, then it also contains the 
disjoint sets B, =A, NASM ++: MAG- and by (A3) contains U, A, = Un Bn: 

a 
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Many uniqueness arguments depend on Dynkin’s 7-A theorem: 


Theorem 3.2. If PY is a m-system and Z is a A-system, then Poy 
implies o( P) CL. 


Proor. Let -% be the A-system generated by #—that is, the intersec- 
tion of all A-systems containing #. It is a A-system, it contains #, and it is 
contained in every A-system that contains # (see the construction of gener- 
ated o-fields, p. 21). Thus PC-% CZ. If it can be shown that %, is also a 
a-system, then it will follow by Lemma 6 that it is a o-field. From the 
minimality of o(#) it will then follow that 0.) C7“), so that PCo( FP) c 
LZ, <2. Therefore, it suffices to show that -⁄/ọ is a m-system. 

For each A, let -⁄4 be the class of sets B such that A MB E Z7 li Ang 
assumed to lie in #, or even if A is merely assumed to lie in /“), then _Y, 
is a A-system: Since ANQ=A EL, by the assumption, -~ satisfies (A,). 
If B,,B,€2, and B,CB,, then the A-system “~, contains AB, and 
ANB, and hence contains the proper difference (A N B,)—(ANB,)= 
AE BD, so that 2, contains B,—B,: 2, satisfies (1). If B, are 
disjoint ~,-sets, then -“ contains the disjoint sets AMB, and hence 
contains their union A N (U, B,,): -⁄4 satisfies (A 3). 

If AEP and BE®, then (P is a m-system) ANB E PCA or 
Be Z,. Thus A € P implies AC&,, and since -⁄/4 is a A-system, minimal- 
iy WIVES -Zo C-Z 4. 

Thus 4 € Z implies “% C-⁄4, or, to put it another way, A © P and 
Be, together imply that Be“, and hence A €-/g. (The key to the 
proof is that B € -⁄} if and only if A E_p.) This last implication means that 
Bez, implies ACL. Since -/ is a A-system, it follows by minimality 
once again that Be- implies <j C-,. Finally, B €E- and Ce 
together imply CE4, or BN CE). Therefore, 4 is indeed a m 


system. a 


Since a field is certainly a 7-system, the uniqueness asserted in Theorem 
3.1 is a consequence of this result: 


Theorem 3.3. Suppose that P, and P, are probability measures on o(P), 
where P is a m-system. If P, and P, agree on P, then they agree on o(P). 


Proor. Let -/ be the class of sets A in o(P#) such that P(A) = P(A). 
Clearly QEZ. If AEZ, then P(A‘) =1— P(A)= 1 -— P(A) = PAY, 
and hence ACE”. If A, are disjoint sets in “, then PU, An) F 
P PADE E PAN- PU, A), and hence U, A, E-Z. Therefore <18 
a A-system. Since by hypothesis AcC_Z and F is a m-system, the m-À 
theorem gives o(#) C_Z, as required. "B 
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Note that the 7-A theorem and the concept of A-system are exactly what 
are needed to make this proof work: The essential property of probability 
measures is countable additivity, and this is a condition on countable disjoint 
unions, the only kind involved in the requirement (A,) in the definition of 
A-system. In this, as in many applications of the 7-A theorem, “Cc o (P) and 
therefore o(P) =_/, even though the relation o(#) CZ itself suffices for 
the conclusion of the theorem. 


Monotone Classes 


A class of subsets of is monotone if it is closed under the formation of 
monotone unions and intersections: 


(i) Ap Az... EK and A, t A imply A EM; 
(ii) A,, Az,...€fM and A, | A imply A E<. 


Halmos’s monotone class theorem is a close relative of the 7-A theorem but will be 
less frequently used in this book. 


Theorem 3.4. If F, isa field and 4M is a monotone class, then Fo C M implies 
HS) E. 


Proor. Let m(.A%)) be the minimal monotone class over j>—the intersection of 
all monotone classes containing Z}. It is enough to prove o(.4%,) Cm( Fo); this will 
follow if m(A%,) is shown to be a field, because a monotone field is a o-field. 

Consider the class ¥=[A: A‘ € m(A,)]. Since m( Fg) is monotone, so is £. Since 
F, isa field, A) CF, and so m( fo) € ¥. Hence m( Fy) is closed under complemen- 
tation. 

Define ¥, as the class of A such that A UB € m(F)) for all BE Fo. Then J is 
a monotone class and A, c4Y,; from the minimality of m(Fo) follows MFg) CF. 
Define Y, as the class of B such that A U B € m(ẸFo) for all A € mF). Then F, 
is a monotone class. Now from m(.%))C FY, it follows that A E m(Fo) and B E Fo 
together imply that A UB ©m(¥4,); in other words, BE Fy implies that BE J. 
Thus FC f>; by minimality, m(F))c 4%, and hence A, B E m( Fo) implies that 
AUB EMP). a 


Lebesgue Measure on the Unit Interval 


Consider once again the unit interval (0, 1] together with the field Z) of 
finite disjoint unions of subintervals (Example 2.2) and the o-field Z = o( Hy) 
of Borel sets in (0, 1]. According to Theorem 2.2, (2.12) defines a probability 
measure A on Z. By Theorem 3.1, A extends to Z, the extended A being 
Lebesgue measure. The probability space ((0, 1], @, A) will be the basis for 
much of the probability theory in the remaining sections of this chapter. A 
few geometric properties of A will be considered here. Since the intervals in 
(0,1] form a m-system generating Z, A is the only probability measure on @ 
that assigns to each interval its length as its measure. 


PAT = 
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Some Borel sets are difficult to visualize: 


Example 3.1. Let {r,,r5,...} be an enumeration of the rationals in (0,1), 
Suppose that e is small, and choose an open interval /,, = (a,,,b,,) such that 
r, €1, €(0, 1) and AU/,,) = b, — a, <€2~". Put A = U;,_,1,,. By subadditivity, 
0 <A(A) <e. 

Since A contains all the rationals in (0, 1), it is dense there. Thus A is an 
open, dense set with measure near 0. If I is an open subinterval of (0, 1), 
then / must intersect one of the /,, and therefore ACAT ONL) =U; 

If B =(0,1) —A then 1 — e < A(B) < 1. The set B contains no interval and 
is in fact nowhere dense [A15]. Despite this, B has measure nearly 1. a 


Example 3.2. There is a set defined in probability terms that has geomet- 
ric properties similar to those in the preceding example. As in Section 1, let 
d (w) be the nth digit in the dyadic expansion of w; see (1.7). Let A, =[we€ 
OMiva@)—a,..(@)—d,,..(@), i’=1,...,], and let A = U,,_, An- Proba- 
bilistically, A corresponds to the event that in an infinite sequence of tosses 
of a coin, some finite initial segment is immediately duplicated twice over. 
From A(A,,) = 2”-2-*” it follows that 0 <A(A) < £2,277" = 3. Again A is 
dense in the unit interval; its measure, less than L, could be made less than e 
by requiring that some initial segment be immediately duplicated k times 
over with k large. a 


The outer measure (3.1) corresponding to A on @, is the infimum of the 
sums ©,,A(4,,) for which A, E @ and ACU, A,,. Since each A, is a finite 
disjoint union of intervals, this outer measure is 


(3.8) A*(A) = inf A 


n 


where the infimum extends over coverings of A by intervals I,- The notion of 
negligibility in Section 1 can therefore be reformulated: A is negligible if and 
only if A*( A) =0. For A in Ø, this is the same thing as A( A) = 0. This covers 
the set N of normal numbers: Since the complement N“ is negligible and lies 
i at A(N‘)=0. Therefore, the Borel set N itself has probability 1: 
MN) = 1. 


Completeness 


This is the natural place to consider completeness, although it enters into probability 
theory in an essential way only in connection with the study of stochastic processes in 
continuous time; see Sections 37 and 38. 
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A probability measure space (Q, F, P) is complete if A CB, B € F, and P(B)=0 
together imply that A € F (and hence that P(A) = 0). If (Q, F, P) is complete, then 
the conditions A € F, AaA' CBE F, and P(B)=0 together imply that A E F 
and P(A’) = P(A). 

Suppose that (Q, F, P) is an arbitrary probability space. Define P* by (3.1) for 
Fy = F=o(F,), and consider the o-field # of P*-measurable sets. The arguments 
leading to Theorem 3.1 show that P* restricted to AW is a probability measure. If 
P*(B)=0 and A CB, then P*(A NE) + P*(AS OE) < P*(B) + P*(E) = P*(E) by 
monotonicity, so that A satisfies (3.5) and hence lies in .4. Thus (Q, 4, P*) is a 
complete probability measure space. In any probability space it is therefore possible to 
enlarge the o-field and extend the measure in such a way as to get a complete space. 

Suppose that ((0,1], Z,A) is completed in this way. The sets in the completed 
o-ficld # are called Lebesgue sets, and A extended to @ is still called Lebesgue 
measure. 


Nonmeasurable Sets 


There exist in (0,1] sets that lie outside @. For the construction (due to Vitali) it is 
convenient to use addition modulo 1 in (0, 1]. For x, y € (0, 1] take x ®y to be x+y 
or x +y — 1 according as x +y lies in (0, 1] or not.’ Put A ®@x = [a @x: a EAÁ]. 

Let “ be the class of Borel sets A such that A@x is a Borel set and 
A(A @x)=ACA). Then / is a A-system containing the intervals, and so BCL by 
the 7-A theorem. Thus 4A € @ implies that A @x € Z and A(A @® x) =AC(A). In this 
sense, A is translation-invariant. 

Define x and y to be equivalent (x ~ y) if x @r=y for some rational r in (0, 1]. 
Let H be a subset of (0,1] consisting of exactly one representative point from each 
equivalence class; such a set exists under the assumption of the axiom of choice [A8]. 
Consider now the countably many sets H ® r for rational r. 

These sets are disjoint, because no two distinct points of H are equivalent. (If 
H@®r, and H @r, share the point h, ®r, =h, ®r,, then h, ~ h3; this is impossible 
unless h, =A, in which case r, =r.) Each point of (0, 1] lies in one of these sets, 
because H has a representative from each equivalence class. (If x ~h €H, then 
x=h@reEH @r for some rational r.) Thus (0,1]= U,(H @r), a countable disjoint 
union. 

If H were in Z, it would follow that A(0,1]= “,A(H @ r). This is impossible: If 
the value common to the A(H @r) is 0, it leads to 1=0; if the common value is 
positive, it leads to a convergent infinite series of identical positive terms (a +a + --- 
< and a > 0). Thus H lies outside Ø. o 


Two Impossibility Theorems* 


The argument above, which uses the axiom of choice, in fact proves this: There exists 
on 2°") no probability measure P such that P(A ® x)= P(A) for all A <2") and all 
x €(0, 1]. In particular it is impossible to extend A to a translation-invariant probabil- 
ity measure on 21), 


‘This amounts to working in the circle group, where the translation y —>x@®y becomes a 
rotation (1 is the identity). The rationals form a subgroup, and the set H defined below contains 
one element from each coset. 

“This topic may be omitted. It uses more set theory than is assumed in the rest of the book. 
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There is a stronger result: There exists on 2") no probability measure P such that 
P{x}=0 for each x. Since A{x} = 0, this implies that it is impossible to extend A to 
21) at all.t 

The proof of this second impossibility theorem requires the well-ordering principle 
(equivalent to the axiom of choice) and also the continuum hypothesis. Let S be the 
set of sequences (s(1), s(2),...) of positive integers. Then S has the power of the 
continuum. (Let the nth partial sum of a sequence in S be the position of the nth 1 in 
the nonterminating dyadic representation of a point in (0, 1]; this gives a one-to-one 
correspondence.) By the continuum hypothesis, the elements of S can be put in a 
one-to-one correspondence with the set of ordinals preceding the first uncountable 
ordinal. Carrying the well ordering of these ordinals over to S by means of the 
correspondence gives to $ a well-ordering relation <,, with the property that each 
element has only countably many predecessors. 

For s,t in S write s <t if s(i) < t(i) for all 1 > 1. Say that ¢ rejects s if t <„ s and 
s < t; this is a transitive relation. Let T be the set of unrejected elements of S. Let V, 
be the set of elements that reject s, and assume it is nonempty. If £ is the first 
element (with respect to <,,) of V, then tE T (if t’ rejects £, then it also rejects s, 
and since t’<,, t, there is a contradiction). Therefore, if s is rejected at all, it is 
rejected by an element of 7. 

Suppose T is countable and let ¢,,/,,... be an enumeration of its elements. If 
t*(k)=1t,(k) +1, then ¢* is not rejected by any ¢, and hence lies in T, which is 
impossible because it is distinct from each t,. Thus T is uncountable and must by the 
continuum hypothesis have the power of (0, 1]. 

Let x be a one-to-one map of T onto (0,1); write the image of t as x,. Let 
A =[x,: t(i) =k] be the image under x of the set of t in T for which t(i) = k. Since 
t(i) must have some value k, U- 4‘, = (0, 1]. Assume that P is countably additive 
and choose u in S in such a way that P(U%, AL) >1—1/2'*! for i= 1. If 


Al (a) (ec): (i) su@) [eeu 


i=l k=1 i=l 


then P(A) > 0. If A is shown to be countable, this will contradict the hypothesis that 
each singleton has probability 0. 

Now, there is some tọ in T such that u < tọ (if u E€ T, take tọ =u; otherwise, u is 
rejected by some fy in T). If ¿<u fora t in T, then t< tọ and hence t <,, ty (since 
otherwise f, rejects t). This means that [t: £ <u] is contained in the countable set [¢: 
t <,, ty], and A is indeed countable. 


PROBLEMS 


3.1. (a) In the proof of Theorem 3.1 the assumed finite additivity of P is used twice 
and the assumed countable additivity of P is used once. Where? 


(b) Show by example that a finitely additive probability measure on a field may 
not be countably subadditive. Show in fact that if a finitely additive probability 


par is countably subadditive, then it is necessarily countably additive as 
well. 


ive extension, of course. If one is content with finite additivity, 
ob 8 
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3.2. 


3.3. 


3.4. 


35. 


O<x<1, y= }] that P,(4)=0 and P*(4)=1. 


(c) Suppose Theorem 2.1 were weakened by strengthening its hypothesis to the 


assumption that ¥ is a ø-field. Why would this weakened result not suffice for 
the proof of Theorem 3.1? 


Let P be a probability measure on a field YA, and for every subset A of Q 
define P*(A) by (3.1). Denote also by P the extension (Theorem 3.1) of P to 
AE G) 


(a) Show that 


(3.9) P*(A)=inf[ P(B): ACB, Be F | 


and (see (3.2)) 


(3.10) PC Ay=sup| P(A Cer |, 


and show that the infimum and supremum are always achieved. 
(b) Show that A is P*-measurable if and only if P, (A) = P*(A). 
(c) The outer and inner measures associated with a probability measure P ona 


o-field Z are usually defined by (3.9) and (3.10). Show that (3.9) and (3.10) are 
the same as (3.1) and (3.2) with F in the role of A. 


2.13 2.15 3.27 For the following examples, describe P* as defined by (3.1) 
and #“=.4(P*) as defined by the requirement (3.4). Sort out the cases in which 
P* fails to agree with P on YA, and explain why. 

(a) Let A, consist of the sets Ø, {1}, {2,3}, and Q = {1,2,3}, and define proba- 
bility measures P, and P, on A, by P{l}=0 and P,{2,3}=0. Note that 
M(P*) and &“(P3) differ. 

(b) Suppose that Q is countably infinite, let A, be the field of finite and 
cofinite sets, and take P(A) to be 0 or 1 as A is finite or cofinite. 

(c) The same, but suppose that Q is uncountable. 


(d) Suppose that Q is uncountable, let A) consist of the countable and the 
cocountable sets, and take P(A) to be 0 or 1 as A is countable or cocountable. 


(e) The probability in Problem 2.15. 
(f) Let P(A)= Lw) for A © Ap, and assume {wp} E o( A). 


Let f be a strictly increasing, strictly concave function on [0,) satisfying 
f(O) =0. For A c(0,1], define P*(A) =f(A*(A)). Show that P* is an outer 
measure in the sense that it satisfies P*() = 0 and is nonnegative, monotone, 
and countably subadditive. Show that A lies in .W (defined by the requirement 
(3.4)) if and only if A*(A) or A*( A‘) is 0. Show that P* does not arise from the 
definition (3.1) for any probability measure P on any field Ap. 


Let Q be the unit square [(x, y): 0 <x, y < 1], let F be the class of sets of the 
form [(x, y): x EA, 0<y < 1], where AE @, and let P have value A(A) at this 
set. Show that (Q, F, P) is a probability measure space. Show for A = bs 


Y 
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3.6. Let P be a finitely additive probability measure on a field AY). For A CQ, in 
analogy with (3.1) define 


(3.11) P°( A) = inf L P(A, ), 


where now the infimum extends over all finite sequences of .F,-sets A, 
satisfying A C U,, A,,. (If countable coverings are allowed, everything is differ. 
ent. It can happen that P(Q) = 0; see Problem 3.3(e).) Let 4° be the class of 
sets A such that PXE) = PA N E) + PAS OE) for all E CO. 

(a) Show that P(@)=0 and that P° is nonnegative, monotone, and finitely 
subadditive. Using these four properties of P°, prove: Lemma 1°: 4° is a field. 
Lemma 2°: If A,, A>,... is a finite sequence of disjoint W°-sets, then for each 
ECQ, 


(3.12) P{ea( Ua} EAEn: 
k 


Lemma 3°: P° restricted to the field /° is finitely additive. 

(b) Show that if P° is defined by (3.11) (finite coverings), then: Lemma 4°: 
Fo CM°. Lemma 5°: PA) = P(A) for A E€ A). 

(c) Define P(A) = 1 — PA‘). Prove that if E CA E€ Yo, then 


(3.13) P (E) =P(A) —P(A-E). 


3.7. 2.7 3.61 Suppose that H lies outside the field Žo, and let FY, be the field 
generated by YF, U {H}, so that F, consists of the sets (H N A) U (H° A B) with 
A, B E€ Fp. The problem is to show that a finitely additive probability measure 
P on A, has a finitely additive extension to Y,. Define Q on F, by 


(3.14) OMG GA) (HNB) =P (HNA) +P (H: AB) 


for A BE Fy: 

(a) Show that the definition is consistent. 

(b) Shows that Q agrees with P on Fy. 

(c) Show that Q is finitely additive on F,. Show that Q(H) = PH). 

(d) Define Q’ by interchanging the roles of P° and P on the right in (3.14). 
Show that Q’ is another finitely additive extension of P to Fi. The same is true 


of any convex combination Q” of Q and Q’. Show that Q”(H) can take any 
value between P.(H) and PH). 


3.8. 1 Use Zorn’s lemma to prove a theorem of Tarski: A finitely additive 
probability measure on a field has a finitely additive extension to the field of all 
subsets of the space. 


ba Let P be a (countably additie) probability measure on a aidd F 
Se 1 F, and let F,=o(FU{H})). By adapting the ideas in 
untably additive extension from F to F. 


a AEE 
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3.10. 


3.11. 


3.12. 


3.13. 


3.14. 


(b) It is tempting to go on and use Zorn’s lemma to extend P to a completely 


additive probability measure on the o-field of all subsets of Q. Where does the 
obvious proof break down? 


DA I2 As shown in the text, a probability measure space (Q, Y, P) has a 
complete extension—that is, there exists a complete probability measure space 
(QO, F,, P,) such that Fc F, and P, agrees with P on F. 


(a) Suppose that (QO, F, P) is a second complete extension. Show by an 
example in a space of two points that P) and P, need not agree on the o-field 
Fs OF 

(b) There is, however, a unique minimal complete extension: Let Y* consist 
of the sets A for which there exist Asets B and C such that AaB cC and 
P(C)= 0. Show that ¥* is a o-field. For such a set A define P*(A) = P(B). 
Show that the definition is consistent, that P* is a probability measure on F +, 
and that (Q, F *, P*) is complete. Show that, if (Q, F, P;) is any complete 
extension of (Q, F, P), then Ftc FY, and P, agrees with P* on F*; 
(Q, F, P*) is the completion of (Q, F, P). 

(c) Show that A€ ¥* if and only if P,(A) =P*(A), where P, and P* are 
defined by (3.9) and (3.10), and that P*(A) = P, (A) = P*(A) in this case. Thus 
the complete extension constructed in the text is exactly the completion. 


(a) Show that a A-system satisfies the conditions 

(A,) A, Bef and ANB=@ imply AUB EZ, 

3) Ay. A>... ev and A,,7 A imply A] Zz, 

ODAT AS EA and) A_.| A mpy A EA 
(b) Show that 7 is a A-system if and only if it satisfies (A,), (V’,), and (A,). 
(Sometimes these conditions, with a redundant (A4), are taken as the definition.) 


2.5 3.117 (a) Show that if # is a m-system, then the minimal A-system over 
FP coincides with o( P). 

(b) Let Z be a m-system and 4 a monotone class. Show that Zc. does not 
imply o(#) C M. 

(ce) Deduce the 7-A theorem from the monotone class theorem by showing 
directly that, if a A-system -/ contains a 7r-system #, then _7 also contains the 
field generated by #. 


2.57 (a) Suppose that A) is a field and P, and P, are probability measures 
on ao( Fg). Show by the monotone class theorem that if P, and P, agree on Fp, 
then they agree on o( Aj). 

(b) Let Y, be the smallest field over the m-system P. Show by the inclusion- 
exclusion formula that probability measures agreeing on # must agree also on 
F,. Now deduce Theorem 3.3 from part (a). 


1.5 2.227 Prove the existence of a Lebesgue set of Lebesgue measure 0 that 
is not a Borel set. 


3.15. 1.3 3.6 3.141 The outer content of a set æ in (0, 1] 


_ where the infimum extends over finite coverings o 
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3.16. 


3.17. 


3.18. 


3.19. 


3.20. 


3.21. 


PROBABILITY 


trifling in the sense of Problem 1.3 if and only if c*(A) = 0. Define inner content 
by c,(A)=1-—c*(A‘). Show that c,(A) = sup Ly, l/,|, where the supremum 
extends over finite disjoint unions of intervals /,, contained in A (of course the 
analogue for A, fails). Show that c,(A) <c*(A); if the two are equal, their 
common value is taken as the content c(A) of A, which is then Jordan 
measurable. Connect all this with Problem 3.6. 

Show that c*(A)=c*(A_), where A` is the closure of A (the analogue for 


d* fails). 
A trifling set is Jordan measurable. Find (Problem 3.14) a Jordan measurable 


set that is not a Borel set. 
Show that c,(A) <A,(A) < \*(A) <c*(A). What happens in this string of 


inequalities if A consists of the rationals in (0, +] together with the irrationals in 
(5, 1]? 


1.57 Deduce directly by countable additivity that the Cantor set has Lebesgue 
measure 0. 


From the fact that A(x ®A)=A(A), deduce that sums and differences of 
normal numbers may be nonnormal. 


Let H be the nonmeasurable set constructed at the end of the section. 

(a) Show that, if A is a Borel set and A CH, then A( A) = 0—that is, A,(H)= 
0. 

(b) Show that, if A*(E) > 0, then E contains a nonmeasurable subset. 


The aim of this problem is the construction of a Borel set A in (0, 1) such that 
0<A(A NG) <A(G) for every nonempty open set G in (0, 1). 


(a) It is shown in Example 3.1 how to construct a Borel set of positive Lebesgue 
measure that is nowhere dense. Show that every interval contains such a set. 


(b) Let {J,} be an enumeration of the open intervals in (0,1) with rational 
endpoints. Construct disjoint, nowhere dense Borel sets A,, B,, A>, B,,... of 
positive Lebesgue measure such that A, UB, C L. J 


(c) Let A = U, A,. A nonempty open G in (0,1) contains some 7,. Show that 
0<A(A,,) <MANG)<MANG)+A(B,) < MG). 


t There is no Borel set A in (0,1) such that aA(J)<A(A ON/) < DAU) for 
every open interval J in (0,1), where O <a <b < 1. In fact prove: 

(a) If AAA N/) <bA/) for all J and if b <1, then A(A) =0. Hint: Choose an 
open G such that A CG C (0, 1) and A(G) <b 'A(A); represent G as a disjoint 
union of intervals and obtain a contradiction. 


(b) If aA(1) <A(A NT) for all J and if a >0, then A(A) = 1. 


Show that not every subset of the unit interval is a Lebesgue set. Hint: Show 
that A* is translation-invariant on 2"; then use the first impossibility theorem 
(p. 45). Or use the second impossibility theorem. 
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Complex probability ideas can be made clear by the systematic use of 
measure theory, and probabilistic ideas of extramathematical origin, such as 
independence, can illuminate problems of purely mathematical interest. It is 
to this reciprocal exchange that measure-theoretic probability owes much of 
its interest. 

The results of this section concern infinite sequences of events in a 
probability space.’ They will be illustrated by examples in the unit interval. 
By this will always be meant the triple (Q, Z, P) for which Q is (0, 1], F is 
the o-field @ of Borel sets there, and P(A) is for A in Z the Lebesgue 
measure A( A) of A. This is the space appropriate to the problems of Section 
1, which will be pursued further. The definitions and theorems, as opposed to 
the examples, apply to all probability spaces. The unit interval will appear 
again and again in this chapter, and it is essential to keep in mind that there 


are many other important spaces to which the general theory will be applied 
later. 


General Formulas 


The formulas (2.5) through (2.11) will be used repeatedly. The sets involved 
in such formulas lie in the basic o-field Z by hypothesis. Any probability 
argument starts from given sets assumed (often tacitly) to lie in Z; further 
sets constructed in the course of the argument must be shown to lie in ¥ as 
well, but it is usually quite clear how to do this. 

If P(A)> 0, the conditional probability of B given A is defined in the 
usual way as 


P(ANB) 


(4.1) P(B\A) = P(A) 


There are the chain-rule formulas 


P( ANB) =P(A)P(BI\A), 
(4.2) P(AQBNC) =P(A)P(BIA)P(CIANB), 


If A,, A,,... partition ©, then 


(4.3) P(B) = LP(A,) P(BIA,). 


‘They come under what Borel in his first paper on the subject (see the footnote on p. 9) called 
probabilités dénombrables; hence the section heading. 
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Note that for fixed A the function P(B|A) defines a probability measure ag 
B varies over F. 

If P(A,)=0, then by subadditivity P(U, A J=0. lf PODS then 
Nn A, has ‘complerhent U,, AS of probability 0. This gives two facts used over 
and over again: i 

If A,, A5,... are sets of probability 0, so is U, An: If Ai, A2,--+ are sets of 
probability 1, so is (\, A,. 


Limit Sets 

For a sequence A,, A,,... of sets, define a set 

(4.4) lim sup A, = M U Ax 
n n=1k=n 

and a set 

(4.5) liminfA,= U () Ag. 
K n=1 k=n 


These sets’ are the limits superior and inferior of the sequence {A,„}. They lie 
in F if all the A, do. Now w lies in (4.4) if and only if for each n there is 
some k >n for which w € A,; in other words, w lies in (4.4) if and only if it 
lies in infinitely many of the A,. In the same way, w lies in (4.5) if and only if 
there is some n such that w € A, for all k > n; in other words, w lies in (4.5) 
if and only if it lies in all but finitely many of the A,. 

Note that ;,_,, A, Tliminf, A, and U;_, A, |limsup, A,. For every m 
and n, (%~A, C UK-n, because for i > max{m, n}, A; contains the first 
of these sets and is contained in the second. Taking the union over m and 
the intersection over n shows that (4.5) is a subset of (4.4). But this follows 
more easily from the observation that if w lies in all but finitely many of the 
A,, then of course it lies in infinitely many of them. Facts about limits inferior 
and superior can usually be deduced from the logic they involve more easily 
than by formal set-theoretic manipulations. 

If (4.4) and (4.5) are equal, write 


(4.6) lim A,, = lim inf A, = lim sup A,. 


n 


To say that A, has limit A, written A,,— A, means that the limits infe: P 
and superior do coincide and in fact ee with A. Since liminf, A (= 
lim sup„ A, always holds, to check whether A„—A is to check whethel 
lim sup„ A, CA Climinf, A,. From A, € F and A,,—A follows A 


‘See Problems 4.1 and 4.2 for the analogy between set-theoretic and numerical in 
and inferior. wr 
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Example 4.1. Consider the functions d,(w) defined on the unit interval by 
the dyadic expansion (1.7), and let / Abo) | K the length of the run of x 
starting at d„(w): 1,(w) =k if d (w) = =d,- (w)= 0 and d (0) = 
here /,(w) =0 if d,(w) = 1. Probabilities can be computed by (1.10). im 
[w: 1, (@) = =k] is a union of 2”~! disjoint intervals of length 2~”~“, it lies in 
F and has probability 2~*~'. Therefore, [w: 1„,(w)>r]=[w: d Aw) = 0, 
n<i<n+r] lies also in ¥ and has probability Cn eee 


(4.7) Plw:1,(@) =r] =27". 
lf 4, is the event in (4.7), then (4.4) is the set of w such that /,(w) =r for 


nfinitely many n, or, n being regarded as a time index, such that /,(w)>r 
infinitely often. g 


Because of the theory of Sections 2 and 3, statements like (4.7) are valid in 
he sense of ordinary mathematics, and using the traditional language of 
probability—“‘heads,” “runs,” and so on—does not change this. 

When n has the role of time, (4.4) is frequently written 


(4.8) lim sup A, =[ A, i.o.], 
where “i.o.” stands for “infinitely often.” 
Theorem 4.1. (i) For each sequence {A,,}, 
(4.9) P(lim inf A„) < lim inf P(A,) 
n n 


< lim sup P(A,,) < P(lim sup A,). 
n n 


(ii) If A, 2A, then P(A,,) > P(A). 


Proor. Clearly (ii) follows from (i). As for (i), if B,=N%_,A, and 
C,= U%_,A,, then B, Tliminf, A, and C, | limsup, A,, so that, by parts 
(i) and (ii) of Theorem 2.1, P(A, )> P(B, ) > P(lim inf, A,) and P(A,)< 
P(C,,) > Pim sup, A,,). i) 


Example 4.2. Define |,(w) as in Example 4.1, and let A, =[@: 1,(w) >r] 
for fixed r. By (4.7) and (4.9), Plw: /,(@)>r io.)=>2~". Much stronger 
results will be proved later. a 


Independent Events 


Events A and B are independent if P(AMB)=P(A)P(B). (Sometimes an 
unnecessary mutually is put in front of independent.) For events of positive 
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probability, this is the same thing as requiring P(B|A) = P(B) or P(A|B) = 


P(A). More generally, a finite collection A,,..., 4, of events is independent 
if 

(4.10) P( Ay, A+++ NA) = P(Ax,) P(A) 

for 2<j<n and 1<k,< ` SK ER. Reordering the sets clearly has no 


effect on the condition for independence, and a subcollection of independent 
events is also independent. An infinite (perhaps uncountable) collection of 
events is defined to be independent in each of its finite subcollections is, 

If n = 3, (4.10) imposes for j = 2 the three constraints 


(4.11) P(A,NA,) =P(A,) P(A), P(A, NA3) = P(A,)P(A;), 
P(A,NA3) = P(A,)P(A3), 


and for j = 3 the single constraint 
(4.12) P(A,NA,NA;) =P(A,)P(A2) P(A3).- 


Example 4.3. Consider in the unit interval the events B,,.=[: d,(w)= 
d (w)|—the uth and vth tosses agree—and let A, = B,,, A, = By3, 43 =B. 
Then A,, A, A; are pairwise independent in the sense that (4.11) holds (the 
two sides of each equation being ¢). But since A, NA, CA}, (4.12) does not 
hold (the left side is + and the right is 4), and the events are not indepen- 
dent. a 


Example 4.4. In the discrete space Q = {1,...,6} suppose each point has 
probability ; (a fair die is rolled). If A, = {1,2,3,4} and A, =A, = {4,5,6}, 
then (4.12) holds but none of the equations in (4.11) do. Again the events are 
not independent. @ 


Independence requires that (4.10) hold for each j =2,...,n and each 
choice of k,,...,k,;, a total of Lro" |=2"—1-—n constraints. This re- 


quirement can be stated in a different wa If B,,..., B, are sets such that for 
each i= 1,...,n either B, =A, or B, =, ‘ten: 


(4.13) PCB, ABS > BY) = P(B P(B) -POT 


The point is that if B; = ©, then B, can be ignored in the intersection on the 
left and the factor P(B, )=1 can be ignored in the product on the right. For 
example, replacing A, by Q reduces (4.12) to the middle equation in (4.11). 
From the seat independence of certain sets it is possible tod educe 
the independence of other sets. p 
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Example 4.5. On the unit interval the events H, =[w: d,(@)=0], n = 
1,2,..., are independent, the two sides of (4.10) having in this case value pii 
It seems intuitively clear that from this should follow the independence, for 
example, of [w: d,(w) =0]=H, and [w: d\(w)=0, dw) = 1] = H, N H3, 
since the two events involve disjoint sets of times. Further, any sets A and 
B depending, respectively, say, only on even and on odd times (like 
[w: d,,(@) = 0 i.o.] and [a: d,,,,\(w) = 1 i.o.]) ought also to be independent. 
This raises the general question of what it means for A to depend only on 
even times. Intuitively, it requires that knowing which ones among H,, H,,... 
occurred entails knowing whether or not A occurred—that is, it requires that 
the sets H3, H4,... “determine” A. The set-theoretic form of this require- 
ment is that A is to lie in the o-field generated by H3, H,,.... From 
A €o(H), H,,...) and BE o(H,, H;,...) it ought to be possible to deduce 
the independence of A and B. a 


The next theorem and its corollaries make such deductions possible. 
Define classes 2/,,...,.% in the basic o-field F to be independent if for 
each choice of A; from 2%, i=1,...,n, the events A,,..., A, are indepen- 
dent. This is the same as requiring that (4.13) hold whenever for each i, 
| <i<n, either B, EZ or B.=). 

Theorem 4.2. If X£, ..., Æ, are independent and each Æ, is a t-system, 
then o(,),...,0(&%) are independent. 


Proor. Let @ be the class & augmented by Q (which may be an 
element of æ, to start with). Then each Ø; is a m-system, and by the 
hypothesis of independence, (4.13) holds if B; € @, i=1,...,n. For fixed 
sets B,,..., B, lying respectively in @,,...,@,, let -Z be the class of Fsets 
B, for which (4.13) holds. Then is a A-system containing the m-system @, 
and hence (Theorem 3.2) containing o(@,) = o(./,). Therefore, (4.13) holds 
if B,, B,,...,B, lie respectively in o(.%,),@,,...,@,, which means that 
a(L,), L,,..., Æ, are independent. Clearly the argument goes through if 1 
is replaced by any of the indices 2,...,n. 

From the independence of o(./,),.%,,...,% now follows that of 


n 


a(L,), oA), H,,..., L, and so on. E 


If A= {A,,..., A,} is finite, then each A in øo(&) can be expressed by a 
“formula” such as A =4, AS or A = (4A, NA) U (43, NAS N4A;). If Æ is 
infinite, the sets in o(.&%) may be very complicated; the way to make precise 
the idea that the elements of “determine” A is not to require formulas, 
but simply to require that A lie in o( 2). 

Independence for an infinite collection of classes is defined just as in the 
finite case: [.&%,: 0 € @] is independent if the collection [ A,: 6 € @] of sets is 
independent for each choice of A, from .%. This is equivalent to the 
independence of each finite subcollection &% ,..., &, of classes, because of 
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the way independence for infinite classes of sets is defined in terms of 
independence for finite classes. Hence Theorem 4.2 has an immediate 
consequence: 


Corollary 1. If .%, 0 € ®, are independent and each &, is a t-system, 
then o(.%,), 0 € O, are independent. 


Corollary 2. Suppose that the array 
(4.14) An Ay 


of events is independent; here each row is a finite or infinite sequence, and there 
are finitely or infinitely many rows. If F, is the o-field generated by the ith row, 
then Fi, F,,... are independent. 


Proor. If & is the class of all finite intersections of elements of the ith 


L 


row of (4.14), then &% is a m-system and o(.Y%)= F. Let I be a finite 
collection of indices (integers), and for each i in J let J; be a finite collection 
of indices. Consider for i € J the element C;= N ;-, Aj; of 7%. Since every 
finite subcollection of the array (4.14) is independent (the intersections and 
products here extend over i €/ and j €J;), 


P| N c) = P| N N A, 5 II AE F3 [1 P(n ;Ai,) 
i i J l i 
= [I P(C;). 
It follows that the classes 27,, Y%,,... are independent, so that Corollary 1 


applies. a 


Corollary 2 implies the independence of the events discussed in Example 
4.5. The array (4.14) in this case has two rows: 


H ee 
H, A, H; 


Theorem 4.2 also implies, for example, that for independent A,,..., A,, 


(4.15) ie P(A N+ MAGN AG A+ MA,) | 
= P( 4$) ++» P( AG) P( Aga) “+ P(A,). 
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To prove this, let .% consist of A, alone; of course, Af € o(.0%4). In (4.15) 
any subcollection of the A; could be replaced by their complements. 


Example 4.6. Consider as in Example 4.3 the events B,,. that, in a 
sequence of tosses of a fair coin, the uth and vth outcomes agree. Let A 
consist of the events B,, and B,,, and let .7, consist of the event B33- Since 
these three events are pairwise independent, the classes 7, and 2, are 
independent. Since B,,=(B,,4B,,)° lies in o(.%7,), however, o(.%7,) and 
(.o/,) are not independent. The trouble is that 7, is not a 7-system, which 
shows that this condition in Theorem 4.2 is essential. a 


Example 4.7. If &/={A,, A,,...} is a finite or countable partition of 2 
and P(B|A;)=p for each A, of positive probability, then P(B) =p and B is 
ndependent of o(.7): If L’ denotes summation over those i for which 
P(A;)> 0, then P(B) = Y'P(A,B) = Y'P(A,)p =p, and so B is indepen- 
dent of each set in the 7-system AU {Ø}. a 


Subfields 


Theorem 4.2 involves a number of o-fields at once, which is characteristic of 
probability theory; measure theory not directed toward probability usually 
involves only one all-embracing o-field F. In proability, o-fields in A—that 
is, sub-o-fields of A—play an important role. To understand their function it 
helps to have an informal, intuitive way of looking at them. 

A subclass Æ of Z corresponds heuristically to partial information. 
Imagine that a point œw is drawn from Q according to the probabilities given 
by P: w lies in A with probability P(A). Imagine also an observer who does 
not know which w it is that has been drawn but who does know for each A in 
A whether w EA or w £ A—that is, who does not know w but does know 
the value of [,(w) for each A in &. Identifying this partial information with 
the class 2 itself will illuminate the connection between various measure- 
theoretic concepts and the premathematical ideas lying behind them. 

The set B is by definition independent of the class oY if P(B|A) = P(B) 
for all sets A in Æ for which P(A) > 0. Thus if B is independent of Æ, 
then the observer’s probability for B is P(B) even after he has received the 
information in Æ; in this case 2/ contains no information about B. The 
point of Theorem 4.2 is that this remains true even if the observer is given 
the information in o(.9), provided that .7 is a m-system. It is to be stressed 
that here information, like observer and know, is an informal, extramathe- 
matical term (in particular, it is not information in the technical sense of 
entropy). 

The notion of partial information can be looked at in terms of partitions. 
Say that points w and w are equivalent if, for every A in Z, w and w lie 
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either both in A or both in A‘—that is, if 
(4.16) L,(@) =L,('), AED. 


This relation partitions Q into sets of equivalent points; call this the Æ 
partition. 


Example 4.8. If w and œ are o(.%)-equivalent, then certainly they are 
oZequivalent. For fixed w and w’, the class of A such that [,(w) = [,(o’) is a 
o-field: if œ and w’ are equivalent, then this o-field contains 27 and hence 
ol). so that w and w’ are also o(.@)-equivalent. Thus equivalence 
and o(.0/)-equivalence are the same thing, and the partition coincides 
with the o(.97)-partition. ad 


An observer with the information in o(.o) knows, not the point w drawn, 
but only the equivalence class containing it. That is exactly the informa- 
tion he has. In Example 4.6, it is as though an observer with the items 
of information in .%, were unable to combine them to get information 
about B,,. 


Example 4.9. If H,=[w: d,(w)=0] as in Example 4.5, and if = 
{H,, H3, Hs,...}, then w and a’ satisfy (4.16) if and only if d,(w) = d„(%w') for 
all odd n. The information in o(.9/) is thus the set of values of d,(w) for n 
odd. a 


One who knows that w lies in a set A has more information about œw the smaller 
A is. One who knows J,(w) for each A in a class Z, however, has more information 
about w the larger Y is. Furthermore, to have the information in /, and the 
information in ., is to have the information in A, U A, not that in A N A. 

The following example points up the informal nature of this interpretation of 
a-fields as information. 


Example 4.10. In the unit interval (Q, F, P) let Y be the o-field consisting of the 
countable and the cocountable sets. Since P(G) is 0 or 1 for each G in Y, each set H 
in F is independent of Y. But in this case the “partition consists of the singletons, 
and so the information in Y tells the observer exactly which w in 2 has been drawn. 
(i) The o-field Y contains no information about H— in the sense that H and Y are 
independent. (ii) The o-field Y contains all the information about H—in the sense 
that it tells the observer exactly which w was drawn. E 


In this example, (i) and (ii) stand in apparent contradiction. But the mathematics is 
in (i)—H and 2 are independent—while (ii) only concerns heuristic interpretation. 
The source of the difficulty or apparent paradox here lies in the unnatural structure of 
the o-field Y rather than in any deficiency in the notion of independence.‘ The 
heuristic equating of o-fields and information is helpful even though it sometimes 


tSee Problem 4.10 for a more extreme example. 
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breaks down, and of course proofs are indifferent to whatever illusions and vagaries 
brought them into existence. 


The Borel—Cantelli Lemmas 


This is the first Borel—Cantelli lemma: 
Theorem 4.3. If U,PCA,,) converges, then P(limsup, A,,) = 0. 


Proor. From lim sup, A, C U%_,,4, follows P(lim sup,, 


A,)< 
P(U (=m Ay) Se and this sum tends to 0 as m > œ if L, P(A,) 
converges. ae) 


By Theorem 4.1, P(A,) > 0 implies that P(liminf, A,,) = 0; in Theorem 
4.3 hypothesis and conclusion are both stronger. 


Example 4.11. Consider the run length l (w) of Example 4.1 and a 
sequence {r,} of positive reals. If the series £1/2™ converges, then 


(4.17) P[w:1,(w) >r, i.o.] =0. 
To prove this, note that if s, is r, rounded up to the next integer, then by 


(4.7), Plw: l (w) >r, ,]=27 5 <2". Therefore, (4.17) follows by the first 
Borel—Cantelli lemma. 


If r,=(1+e)log,n and e is positive, there is convergence because 
2-1 =n as 
(4.18) P[w:1,(w) = (1 +€) log, ni.o.] =0 
The limit superior of the ratio /,(w)/log,n exceeds 1 if and only if w 


belongs to the set in (4.18) for some positive rational e. Since the union of 
this countable class of sets has probability 0, 


(4.19) plo: lima sup ae ap > 1|- 0. 


To put it the other way around, 


(4.20) P| a: lim, Sup a a < <1| = | 


Technically, the probability in (4.20) refers to Lebesgue measure. Intu- 
itively, it refers to an infinite sequence of independent tosses of a fair coin. m 


In this example, whether limsup,,/,(@)/log,n <1 holds or not is a 
property of w, and the property in fact holds for w in an Fset of probability 
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1. In such a case the property is said to hold with probability 1, or almost 
surely. In nonprobabilistic contexts, a property that holds for w outside a 
(measurable) set of measure 0 holds almost everywhere, or for almost all w. 


Example 4.12. The preceding example has an interesting arithmetic consequence. 
Truncating the dyadic expansion at n gives the standard (n — 1)-place approximation 
¥"-1d,(w)2~* to w; the error is between 0 and 2~"*!, and the error relative to the 


maximum is 


i=1 


(4.21) e,(@) = ES 


which lies between 0 and 1. The binary expansion of e,(w) begins with /,(w) O's, and 
then comes a 1. Hence .0...01 <e,(w) < .0...0111..., where there are /,(w) 0’s in 
the extreme terms. Therefore, 


(4.22 <e,(w) < 


Qin(w) +1 2! n(o) 2 


so that results on run length give information about the error of approximation. 

By the left-hand inequality in (4.22), e,(w) <x, (assume that 0 <x, < 1) implies 
that /,(w) > —log, x,,—1; since £27" <œ implies (4.17), Dx, < implies Plw: 
€,(w) <x, 1.0.] = 0. (Clearly, [w: e,(w) <x] is a Borel set.) In particular, 


(4.23) Plw:e,(w) < 1/n!* i.o.] =0. 


Technically, this probability refers to Lebesgue measure; intuitively, it refers to a 
point drawn at random from the unit interval. a 


Example 4.13. The final step in the proof of the normal number theorem 
(Theorem 1.2) was a disguised application of the first Borel—Cantelli lemma. 
If A, =[o: |n~'s,(w)|>n~'/*], then DPC(A,) <@, as follows by (1.29), and 
so P[A,, i.o.] = 0. But for w in the set complementary to [ A, i.o.], n—'s (w) 
>Q. 

The proof of Theorem 1.6 is also, in effect, an application of the first 
Borel-Cantelli lemma. a 


This is the second Borel—Cantelli lemma: 


Theorem 4.4. If {A,,} is an independent sequence of events and ¥,,P(A,) 
diverges, then P(limsup, A,) = 1. 


Proor. It is enough to prove that P(U=_, N%_, 4¢)=0 and hence 
enough to prove that P(Mz_,, 4$) = 0 for all n. Since 1 —x <e~*, 


n+j 


= LI (1 — P(A,)) < exp 


nt+j 
ğa 
k=n 


SEAT] 
k=n 
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Since X, P(A,) diverges, the last expression tends to 0 as j > ®, and ae 
P(e =n Ay) = lim, PONZ, AS) = 0. 


By Theorem 4.1, lim sup, P(A,,) > 0 implies P(lim sup, A,,) > 0; in Theo- 
rem 4.4, the hypothesis L,, P(A,,) = œ is weaker but the conclusion is stronger 
because of the additional hypothesis of independence. 


Example 4.14. Since the events [w: /,(w) =0]=[o: d,(w) =1], n= 
, Dy EER are independent and have probability 5, Plw: l (w) =0 i.o.] = 1. 
Since the events A, =[w: 1,(w) = 1]=[w: d,(w) =0, d,4(@) =1]), n= 


2 are not independent, this argument is insufficient to prove that 


(4.24) Pioio = kio e, 

But the events A,, A4, A,,... are independent (Theorem 4.2) and their 
probabilities form a divergent series, and so Plw: /,,(w) = 1 i.o.] = 1, which 
implies (4.24). N 


Significant applications of the second Borel-Cantelli lemma usually re- 


guire, in order to get around problems of dependence, some device of the 
kind used in the preceding example. 


Example 4.15. There is a complement to (4.17): If r,, is nondecreasing and 
Bolen: 


22 "r/r, diverges, then 
(4.25) P[w:1,(@) =r, i.o.] =1. 


To prove this, note first that if r, is rounded up to the next integer, then 
22°" /r,, still diverges and (4.25) is unchanged. Assume then that r, =1r(n) is 
integral, and define {n,} inductively by n, = 1 and n,,, = n, +r, , kl eet 
A,=[w: l (o) zr, J=[w: dfw)=0, ny <i<n,,,]; since the A, involve 
nonoverlapping sequences of time indices, it follows by Corollary 2 to 
Theorem 4.2 that A,, A,,... are independent. By the second Borel—Cantelli 
lemma, P[A, i.o.]=1 if L,P(A,)=2,27' diverges. But since rds 
nondecreasing, 


ye E th) ee, —n,) 


k21 k21 
So s Mansa ar! = D, A eae 
k21 ny<sn<n,,, n21 


Thus the divergence of £,2~"r-' implies that of L,2 
that, with probability 1, [, L@) =n, 
Stronger than (4.25). 

The result in Example 4.2 follows if r, =r, but this is trivial. If r, = log, n, 
then £27" /’, = X1/(n log, n) diverges, and therefore 


(4.26) P[w: l (w) = log, n i.o.] =1. 


va), and it follows 
for infinitely many values of k. But this is 
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By (4.26) and (4.20), 


I,(@ ) 5 

(4.27) Plo: peek te log, al | 
Thus for w in a set of probability 1, log, as a function of n is a kind of 
“upper envelope” for the function /,,(w). a 


Example 4.16. By the right-hand inequality in (4.22), if lw) > log,n, then 
e,(w) < 1/n. Hence (4.26) gives 


line 
(4.28) Plw:e,(w) < > i.o.| =1. 


This and (4.23) show that, with probability 1, e,(w) has 1/n as a “lower envelope.” 
The discrepancy between w and its (n — 1)-place approximation LZ — dlw)? will 
fall infinitely often below 1 / (n 2”~') but not infinitely often below 1/(n'**2"~*). @ 


Example 4.17. Examples 4.12 and 4.16 have to do with the approximation of real 
numbers by rationals: Diophantine approximation. Change the x, =1/n of (4.23) 
to 1/(n — 1)log2)'**. Then Xx, still converges, and hence 


Plo: e,(w) <1/(log N E i.o.| =0. 
And by (4.28), 
Pla: e,(w) <1/log 2”! i.o.| = 1. 


The dyadic rational LZ=}d,(w)2~“ =p/q has denominator q = 2”7', and e,(w) = 
q(w —p/q). There is therefore probability 1 that, if q is restricted to the powers of 2, 
then 0<w—p/q<1/(q log gq) holds for infinitely many p/q but 0 <w —pD/a= 
1 /(q log'*<q) holds only for finitely many.‘ But contrast this with Theorems 1.5 and 
1.6: The sharp rational approximations to a real number come not from truncating its 
dyadic (or decimal) expansion, but from truncating its continued-fraction expansion; 
see Section 24. a 


The Zero—One Law 


For a sequence A,,A,,... of events in a probability space (Q, F, P) 
consider the o-fields o(A,, A,,,,,...) and their intersection 


o0 


(4.29) T= (SCAN AI 


n=1 


‘This ignores the possibility of even p (reducible p/q); but see Problem 1.11(b). And rounding 
w up to (p+ 1)/q instead of down to p/q changes nothing; see Problem 4.13. 
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This is the tail o-field associated with the sequence {A,,}, and its elements are 
called tail events. The idea is that a tail event is determined solely by the A, 
for arbitrarily large n. 


Example 4.18. Since lim sup,, An = NisnUispA, and lim inf „n Am = 
UponQisxA; are both in o(A,, A, ,1,...), the limits superior and inferior 
are tail events for the sequence {A,}. bal 


Example 4.19. Let 1,(@) be the run length, as before, and let H, = [w: 
d,(w) = 0]. For each no, 


[o:1,(@) =r, io] = 1) U [o:1,(o) =r] 


n>ny ken 


= (N URNA 10 Oey H 1 


n2=NnNv k>n 


Thus [w:/,(@)>~r,, i.o.] is a tail event for the sequence {H,,}. a 


The probabilities of tail events are governed by Kolmogorov’s zero—one 
law: ' 


Theorem 4.5. If A,, A,,... is an independent sequence of events, then for 
each event A in the tail o-field (4.29), P(A) is either 0 or 1. 


Proor. By Corollary 2 to Theorem 4.2, o(A,),...,0(A,_,), 
o(A,,A,+41.---) are independent. If A € 7, then A €o(A,, A,,,,...) and 


therefore A,,...,A,_,, A are independent. Since independence of a collec- 
tion of events is defined by independence of each finite subcollection, the 
sequence A, A,, A>,... is independent. By a second application of Corollary 


2 to Theorem 4.2, o(A) and o(A,, A,,...) are independent. But A E Zc 
o(A,, A5,...); from A €o(A) and A €o(A,, A;,...) it follows that A is 
independent of itself: P(A NA) =P(A)P(A). This is the same as P(A) = 
(P(A))? and can hold only if PCA) is 0 or 1. a 


Example 4.20. By the zero—one law and Example 4.18, P(limsup, A,,) is 
0 or 1 if the A, are independent. The Borel—Cantelli lemmas in this case go 
further and give a specific criterion in terms of the convergence or divergence 
of LPAR) a 


Kolmogorov’s result is surprisingly general, and it is in many cases quite 
easy to use it to show that the probability of some set must have one of the 
extreme values 0 and 1. It is perhaps curious that it should so often be very 
difficult to determine which of these extreme values is the right one. 


t 7 
For a more general version, see Theorem 22.3 
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Example 4.21. By Kolmogorov’s theorem and Example 4.19, lw: lwo) >r, 
i.o.] has probability 0 or 1. Call the sequence {r„} an outer boundary or an 
inner boundary according as this probability is 0 or 1. 

In Example 4.11 it was shown that {r„} is an outer boundary if £2 "7 < œ. 
In Example 4.15 it was shown that {r,} is an inner boundary if r, is 
nondecreasing and £2~"»r;' =œ. By these criteria r,=@log,n gives an 
outer boundary if @ > 1 and an inner boundary if 6 < 1. 

What about the sequence r, =log,n+6log,log,n? Here X2 " = 
L1/n(log, n)®, and this converges for 6 > 1, which gives an outer boundary. 
Now 2~"*r, | is of the order 1/n(log, n)'*®, and this diverges if 0 < 0, which 
gives an inner boundary (this follows indeed from (4.26)). But this analysis 
leaves the range 0 < @ < 1 unresolved, although every sequence is either an 
inner or an outer boundary. This question is pursued further in Example 6.5. 

a 


PROBLEMS 


4.1. 2.17 The limits superior and inferior of a numerical sequence {x,} can be 
defined as the supremum and infimum of the set of limit points—that is, the set 
of limits of convergent subsequences. This is the same thing as defining 


(4.30) lim supx,= A V x; 
n n=1k=n 

and 

(4.31) lim infx,= V A xg. 
4 n=l k=n 


Compare these relations with (4.4) and (4.5) and prove that 


E A T lim sup Z4,» Timint, 4, = lim inf 4,- 
n n 


Prove that lim, A, exists in the sense of (4.6) if and only if lim „Z4 (w) exists for 
each w. : 


4.2. t (a) Prove that 
(lim sup A,,} N (lim sup B, ) D lim sup (4, N B,), 
n n n 
(lim sup A, U (lim sup B,) = lim ap (AS UB,), 
(lim inf A,,) (lim inf B, ) = lim inf (4, ^ B,), 
n n n 
(lim inf A, U (lim inf B, Clim inf (A, U B,). 


Show by example that the two inclusions can be strict. 
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(b) The numerical analogue of the first of the relations in part (a) is 
(lim supx,) A (lim supy, ) > lim sup (x, ^Yn): 
n n n 


Write out and verify the numerical analogues of the others. 
(c) Show that 


lim sup As = (lim infA,,) , 
n n 
C 
lim inf 46 = (lim sup A,,] 
n n 
lim sup A,, — lim inf A, = lim sup(A, NAF +1) 
n n n 


= lim sup (AS NA,41)- 


n 
(d) Show that A, >A and B,—>B together imply that A„ U B, 2A UB and 
A,, OB, >°AAB. 


4.3. Let A, be the square [(x, y): |x] <1, |y| < 1] rotated through the angle 2mnô. 
Give geometric descriptions of limsup, A, and lim inf A,, in case 
(a) 0=3; 
(b) @ is rational; 
(c) @ is irrational. Hint: The 27n@ reduced modulo 27 are dense in [0,27] if 0 
is irrational. 
(d) When is there convergence is the sense of (4.6)? 


4.4. Find a sequence for which all three inequalities in (4.9) are strict. 


4.5. (a) Show that lim, P(lim inf, A, N A,)=0. Hint: Show that lim sup, 

liminf, A, OA% is empty. 
Put A* =limsup, A, and A, =liminf, A,. 

(b) Show that P(A, —A*)—>0 and P(A, —A,) > 0. 
(c) Show that A, — A (in the sense that A = A* = A,) implies P(AAA,) > 0. 
(d) Suppose that A, converges to A in the weaker sense that P(AAA*) = 
P(AAA,) =0 (which implies that P(A* — A) = 0). Show that P(AAA,) > 0 
(which implies that P(A,,) > P(A)). 


4.6. In a space of six equally likely points (a die is rolled) find three events that are 
not independent even though each is independent of the intersection of the 
other two. 


4.7. For events A,,...,A,, consider the 2” equations PBO NB )= 
P(B,)--- P(B,) with B; =A, or B,=A‘ for each i. Show that aes” ie 
independent if all these equations hold. ái 


4.8. For each of the following classes 2, describe the partition defined by (4.16) 
(a) The class of finite and cofinite sets. 
(b) The class of countable and cocountable sets. 
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4.9. 


4.10. 


4.11. 


4.12. 
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(c) A partition (of arbitrary cardinality) of Q. 
(d) The level sets of sin x (Q = R'). 
(e) The o-field in Problem 3.5. 


2.9 2.107 In connection with Example 4.8 and Problem 2.10, prove these 


facts: 
(a) Every set in o(.9/) is a union of equivalence classes. 


(b) If o7=[A,: 0 €O), then the equivalence classes have the form fN ,B,, 
where for each 6, By is Ag or AG. 
(c) Every finite o-field is generated by a finite partition of Q. 


(d) If A, is a field, then each singleton, even each finite set, in o( Fp) is a 


countable intersection of A -sets. 


3.2? There is in the unit interval a set H that is nonmeasurable in the 
extreme sense that its inner and outer Lebesgue measures are 0 and 1 (see (3.9) 
and (3.10)): A,(H)=0 and A*(H) = 1. See Problem 12.4 for the construction. 

Let Q = (0, 1], let £ consist of the Borel sets in Q, and let H be the set just 
described. Show that the class Z of sets of the form (H N G,) UCH* A G2) for 
G, and G, in # is a o-field and that PI(H N G,) U (H° NG,)] = 3A(G,) + 
+A(G,) consistently defines a probability measure on Z. Show that P(H)= + 
and that P(G)=A(G) for G € Y. Show that Y is generated by a countable 
subclass (see Problem 2.11). Show that 2 contains all the singletons and that H 
and ¥ are independent. 

The construction proves this: There exist a probability space (Q, F, P), a 
o-field Y in F, and a set H in F, such that P(H)=4, H and 2 are 
independent, and Y is generated by a countable subclass and contains all the 
singletons. 

Example 4.10 is somewhat similar, but there the o-field ¥ is not countably 
generated and each set in it has probability either 0 or 1. In the present example 
Y is countably generated and P(G) assumes every value between 0 and 1 as G 
ranges over Y. Example 4.10 is to some extent unnatural because the Y there 
is not countably generated. The present example, on the other hand, involves 
the pathological set H. This example is used in Section 33 in connection with 
conditional probability; see Problem 33.11. 


(a) If A,, A,,... are independent events, then P(()*_,A,) = TI? _,P(A,,) and 
P(U),-1A,) =1-TIZ_,0 — P(A,)). Prove these facts and from them derive 
the second Borel—Cantelli lemma by the well-known relation between infinite 
series and products. 

(b) Show that P(limsup, A,)=1 if for each k the series > Reh An Ae 
‘++ (Al, ,) diverges. From this deduce the second Borel—Cantelli lemma once 
again. 

(c) Show by example that P(limsup, A,) =1 does not follow from the diver- 
gence of Y,,P(A,| A, +++ AAS) alone. 

(d) Show that P(limsup, A,,) = 1 if and only if £, P(A OA,,) diverges for each 
A of positive probability. 

(e) If sets A, are independent and P(A,) <1 for all n, then P(A, i.o.] = 1 if 
and only if P(U,,A,,) = 1. 


(a) Show (see Example 4.21) that log, n + log, log, n + 0 log, log, log, n is an 
outer boundary if 0 > 1. Generalize. 


(b) Show that log, n + log, log, log, n is an inner boundary. 


SECTION 5. SIMPLE RANDOM VARIABLES id 


4.13. Let @ be a positive function of integers, and define B, as the set of x in (0, 1) 
such that |x — p/2'| < 1/2'p(2') holds for infinitely many pairs p,i. Adapting 
the proof of Theorem 1.6, show directly (without reference to Example 4.12) 
that L,1/(2') < © implies A( B,) =0. 


4.14, 2.197 Suppose that there are in (Q, F, P) independent events Aj, A2,--- 
such that, if a, =min{P(A,),1—P(A,)}, then La, =. Show that P is 
nonatomic. 


4.15. 2.187 Let F be the set of square-free integers—those integers not divisible by 
any perfect square. Let F, be the set of m such that p?|m for no p </, and 
show that D(F,)=T1, <: —p~?). Show that P,(F,— F) <2, 5,;p 7, and con- 
clude that the square-free integers have density [1,(1—p 7) = 6/1. 


4.16. 2.187 Reconsider Problem 2.18(d). If D were countably additive on f(-#), it 
would extend to o(.#). Use the second Borel—Cantelli lemma. 


SECTION 5. SIMPLE RANDOM VARIABLES 
Definition 


Let (Q, F, P) be an arbitrary probability space, and let X be a real-valued 
function on Q; X is a simple random variable if it has finite range (assumes 
only finitely many values) and if 


(5.1) [w: X(w) =x] E F 


for each real x. (Of course, [w: X(w) =x] = ØE F for x outside the range 
of X.) Whether or not X satisfies this condition depends only on F, not on 
P, but the point of the definition is to ensure that the probabilities Plw: 
X(w) =x] are defined. Later sections will treat the theory of general random 
variables, of functions on having arbitrary range; (5.1) will require modifi- 
cation in the general case. 

The d,(w) of the preceding section (the digits of the dyadic expansion) are 
simple random variables on the unit interval: the sets [w: d,(w) = 0) and [w: 
d,(w) = 1] are finite unions of subintervals and hence lie in the o-field @ of 
Borel sets in (0,1]. The Rademacher functions are also simple random 
variables. Although the concept itself is thus not entirely new, to proceed 
further in probability requires a systematic theory of random variables and 
their expected values. 

The run lengths /,(w) satisfy (5.1) but are not simple random variables, 
because they have infinite range (they come under the general theory). In a 
discrete space, Z consists of all subsets of Q, so that (5.1) always holds. 

It is customary in probability theory to omit the argument w. Thus X 
stands for a general value X(w) of the function as well as for the function 
itself, and [X = x] is short for [w: X(w) =x] 
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A finite sum 


(5.2) Xe Dx ly 


is a random variable if the A, form a finite partition of Q into F sets. 
Moreover, every simple random variable can be represented in the form 
(5.2): for the x, take the range of X, and put A; =[X = x;]. But X may have 
other such representations because x;/, can be replaced by Sx a, if the 
A,, form a finite decomposition of A; into Asets. 

If £ is a sub-o-field of F, a simple random variable X is measurable F, 
or measurable with respect to £, if |X =x] © F for each x. A simple random 
variable is by definition always measurable Z. Since [X € H] = U[X =x], 
where the union extends over the finitely many x lying both in H and in the 
range of X,[X © H]€ # for every H CR' if X is a simple random variable 
variable measurable Z. 

The o-field o(X) generated by X is the smallest o-field with respect to 
which X is measurable; that is, o(X) is the intersection of all o-fields with 
respect to which X is measurable. For a finite or infinite sequence X,, X,,... 
of simple random variables, o(X,, X,...) is the smallest o-field with respect 
to which each X, is measurable. It can be described explicitly in the finite 
case: 


Theorem 5.1. Let X,,..., X„ be simple random variables. 
(i) The o-field o(X,,..., X,,) consists of the sets 
(5.3) [(X,.-.,X,) EH] =[o: (X,(@),..., X,(@)) eH] 


for H CR"; H in this representation may be taken finite. 
(ii) A simple random variable Y is measurable o(X,,..., X,,) if and only if 


(5.4) Visi eXin exe] 
for some f: R” > R'. 


Proor. Let æ be the class of sets of the form (5.3). Sets of the form 
(CX), -.., Xa) = Ory, X= 7 LX; =x;] must lie in o(X,,..., X,); each 
set (5.3) is a finite union of sets of this form because O ss A 29 
mapping from Q to R”, has finite range. Thus 4c CCX iA 

On the other hand, # is a o-field because Q =[(X,,..., X,) ER"), 
(CX ini SQA = (Gin. Xe) HEL wand UI(%,...,X,) € Aj] = 
(Xp... Xn) NO U;H;]. But each X, is measurable with respect to 4, be- 
cause [X,=x] can be put in the form (5.3) by taking H to consist of those 
(x,,...,X,) in R” for which x; =x. It follows that o(X,,..., X,,) is contained 
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in @ and therefore equals æ. As intersecting H with the range (finite) 
of (X;,...,X,) in R” does not affect (5.3), H may be taken finite. This 
proves (i). 

Assume that Y has the form (5.4)—that is, Yw) =f(X(o),-.-, Xy Ao) 
for every w. Since [Y = y] can be put in the form (5.3) by taking H to consist 
of those x =(x),...,x,) for which f(x) =y, it follows that Y is measurable 
öli XD 

Now assume that Y is measurable o(X,,...,X,). Let yg... y, be the 
distinct values Y assumes. By part (i), there exist sets H,,...,H, in R” such 
that 


[w: ¥(w) =y,] =[w: (X\(w),..., X,(@)) © Hi]. 


lake f= }i-1Yily, Although the H, need not be disjoint, if H, and H, 
share a point of the form (X\(w),..., X,(w)), then Y(w) = y; and Y(w) =y,, 
which is impossible if i +j. Therefore each (X(w),..., X,(@)) lies in exactly 
ne of the H,, and it follows that f(X(o),..., X (w) = Yw). @ 


Since (5.4) implies that Y is measurable o(X,,...,X,,), it follows in 
particular that functions of simple random variables are again simple random 
variables. Thus X*, e'*, and so on are simple random variables along with X. 
Taking f to be X ix; IIF 1x; or max; .,, x, shows that sums, products, and 
maxima of simple random variables are simple random variables. 

As explained on p. 57, a sub-o-field corresponds to partial information 
about w. From this point of view, o(X,,..., X,,) corresponds to a knowledge 
of the values X,(w),..., X,(w). These values suffice to determine the value 
Y(@) if and only if (5.4) holds. The elements of the o( X b- X,,)-partition 
(see (4.16)) are the sets [X,=x,,..., X, =x,] for x, in the range of X,. 


Example 5.1. For the dyadic digits d,(w) on the unit interval, d, is not 
measurable o(d,, d,); indeed, there exist w and œ” such that d (w) = dw”) 
and d,(w’)=d,(w") but d,(w’)#d,(w"), an impossibility if d,(w) = 
f(d (w), d,(w)) identically in œw. If such an f existed, one could unerringly 
predict the outcome d,(w) of the third toss from the outcomes dw) and 
d (w) of the first two. a 


Example 5.2. Let s,(w)=Lg_\r,(w) be the partial sums of the 
Rademacher functions—see (1.14). By Theorem 5.1(ii) Są iS measurable 
o(r,,...,7,) for k<n, and r,=s,—S,_, is measurable o(S,,...,5,) for 
k <n. Thus o(r,,...,r,) =@(5,,...,5,)- In random-walk terms, the first n 
Positions contain the same information as the first n distances moved. In 
gambling terms, to know the gambler’s first n fortunes (relative to his initial 
fortune) is the same thing as to know his gains and losses on each of the first 
n plays. a 
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Example 5.3. An indicator J, is measurable if and only if A lies in 7, 
And A €o(A,,..., A,,) if and only if 1, =fUy,,..., 14,) for some f: R” > R!, 
a 


Convergence of Random Variables 


It is a basic problem, for given random variables X and X,, X3,... ona 
probability space (Q, F, P), to look for the probability of the event that 
lim, X,(@) =X(@). The normal number theorem is an example, one where 
the probability is 1. It is convenient to characterize the complementary event: 
X (w) fails to converge to X() if and only if there is some e such that for no 
m does |X,(w) — X(w)| remain below e for all n exceeding m—that is to say, 
if and only if, for some e, |X,(@) — X(w)| > e holds for infinitely many values 
of n. Therefore, 


(5.5) [lim x,=X] = Uy) [ee KE Noa 


where the union can be restricted to rational (positive) e because the set in 
the union increases as e decreases (compare (2.2)). 

The event [lim, X,, =X] therefore always lies in the basic o-field F, and 
it has probability 1 if and only if 


(5.6) P[|X, —X|>€ io.] =0 


for each e (rational or not). The event in (5.6) is the limit superior of the 
events [|X, — X|> e], and it follows by Theorem 4.1 that (5.6) implies 


(5.7) lim P[|X, —X|>e] =0. 


This leads to a definition: If (5.7) holds for each positive e, then X, is said to 
converge to X in probability, written X,„ >p X. 
These arguments prove two facts: 


Theorem 5.2. (i) There is convergence lim, X, =X with probability 1 if 
and only if (5.6) holds for each e. 


(ii) Convergence with probability 1 implies convergence in probability. 


Theorem 1.2, the normal number theorem, has to do with the convergence 
with probability 1 of n~'L?_,d(w) to 4. Theorem 1.1 has to do instead with 
the convergence in probability of the same sequence. By Theorem 5.2(ii), 
then, Theorem 1.1 is a consequence of Theorem 1.2 (see (1.30) and (1.31)). 
The converse is not true, however—convergence in probability does not 
imply convergence with probability 1: 
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Example 5.4. Take X =0 and X, =1,. Then X, >, X is equivalent to 
P(A,,) > 9, and [lim, X, =X] =[A, io.) Any sequence {A,,} such that 
P(A,)>0 but PLA, io.]>0 therefore gives a counterexample to the 
converse to Theorem 5.2(ii), 

Consider the event A, =[w: l(w)> logn] in Example 4.15. Here, 
P(A,,) <1/n +0, while PLA, io.) =1 by (4.26), and so this is one coun- 
terexample. For an example more extreme and more transparent, define 
events in the unit interval in the following way. Define the first two by 


A, =(0) 31) A,= z1]. 
Define the next four by 
A,=(0, als A,=(7,3]; A; =6,41; A, =(4,1]. 


Define the next eight, A.,..., A,,4, as the dyadic intervals of rank 3. And so 
Certainly, P(A„) > 0, and since each point w is covered by one set in 


ach successive block of length 2*, the set [ A, i.o.] is all of (0, 1]. a 
Independence 
A sequence X,, X,,... (finite or infinite) of simple random variables is by 


definition independent if the classes o(X,), o(X,),... are independent in the 
sense of the preceding section. By Theorem 5.1(i), o(X;) consists of the set 
[X;€H] for HCR'. The condition for independence of X,,...,X, is 
therefore that 


(5.8) Pl X,CHh,..., AGC isaac) R] 


for linear sets H,,..., H,,. The definition (4.10) also requires that (5.8) hold if 
one or more of the [ X; € H;] is suppressed; but taking H, to be R' eliminates 
it from each side. For an infinite sequence X,, X,,..., (5.8) must hold for 
each n. A special case of (5.8) is 


(5.9) P[X,=x,,.-.,X,=4,] EPIX =%,] °°: PX, =a). 


On the other hand, summing (5.9) over x, © H,,..., x, © H, gives (5.8). Thus 
the X, are independent if and only if (5.9) holds for all x,,..., Xn. 
Suppose that 


Xu Xp 
(5.10) Xa Xn 


is an independent array of simple random variables. There may be finitely or 
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infinitely many rows, each row finite or infinite. If 7% consists of the finite 
intersections N ,[X,; € H,] with H,CR', an application of Theorem 4.2 
shows that the o-fields o(X;, X;2,..-), i= 1,2,... are independent. As a 
consequence, Y,,Y>,... are independent if Y, is measurable o(X;,, Xj,-..) 
for each i. 


Example 5.5. The dyadic digits d,\(w), d,(w),... on the unit interval are 
an independent sequence of random variables for which 


(5.11) Pid, = 0) =P[d, = 1] = 3. 


It is because of (5.11) and independence that the d, give a model for tossing 
a fair coin. 

The sequence (d,(w), d,(w),...) and the point w determine one another. 
It can be imagined that w is determined by the outcomes d,(w) of a 
sequence of tosses. It can also be imagined that w is the result of drawing a 
point at random from the unit interval, and that w determines the d,(w). In 
the second interpretation the d,(w) are all determined the instant œw is 
drawn, and so it should further be imagined that they are then revealed to 
the coin tosser or gambler one by one. For example, a(d,, d,) corresponds to 
knowing the outcomes of the first two tosses—to knowing not w but only 
dw) and d,(@)—and this does not help in predicting the value d,(), 
because o(d,,d,) and o(d,) are independent. See Example 5.1. a 


Example 5.6. Every permutation can be written as a product of cycles. 
For example, 


k ee 0 4) =(1562)87)(4). 


This permutation sends 1 to 5, 2 to 1, 3 to 7, and so on. The cyclic form on 
the right shows that 1 goes to 5, which goes to 6, which goes to 2, which goes 
back to 1; and so on. To standardize this cyclic representation, start the first 
cycle with 1 and each successive cycle with the smallest integer not yet 
encountered. 

Let © consist of the n! permutations of 1,2,..., 7, all equally probable; F 
contains all subsets of Q, and P(A) is the fraction of points in A. Let X,(w) 
be 1 or 0 according as the element in the kth position in the cyclic 
representation of the permutation w completes a cycle or not. Then S(w) = 
yZ-1X,(w) is the number of cycles in w. In the example above, n=7, 
X, =X,=X;,=X5=0, X,=X,=X,=1, and S =3. The following argument 
shows that X,,..., X„ are independent and P[X, = 1]=1/(n—k + 1). This 
will lead later on to results on P[S € H]. 
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The idea is this: X\(w) = 1 if and only if the random permutation w sends 
| to itself, the probability of which is 1/n. If it happens that X,(w) = 1—that 
w fixes 1—then the image of 2 is one of 2,...,n, and X,(w) = 1 if and only if 
this image is in fact 2; the conditional probability of this is 1/(7 — 1). If 
X (œ) = 0, on the other hand, then w sends 1 to some i + 1, so that the image 
of i is one of 1,...,i— 1, i +1,...,n, and X,(w) = 1 if and only if this image 
is in fact 1; the conditional probability of this is again 1/(n — 1). This 
argument generalizes. 


But the details are fussy. Let Y(w),..., Y (w) be the integers in the successive 
positions in the cyclic representation of w. Fix k, and let A, be the set where 
(Xi Xk- Yi»+--» Yk) assumes a specific vector of values V= (x;,:--,Xk-1 
Yp: Yx). The A, form a partition æ of Q, and if PLX, =1|4,]1=1/(n -k + 1) 
for each v, then by Example 4.7, P[X, = 1] = 1/(n — k + 1) and X, is independent of 


a( A% ) and hence of the smaller o-field o(X,,..., X,_,). It will follow by induction 
that AP X,, are independent. 

Let j be the position of the rightmost 1 among x,,...,%,—, (/=09 if there are 
none). Then øw lies in A, if and only if it permutes y,,..., y; among themselves (in a 
way specified by the values Lignan jo 7). &y = ly Ye 3 Yp) and sends each of 
yame y,—, to the y just to its right. Thus A, contains (n — k + 1)! sample points. 


And X,(@) = 1 if and only if w also sends y, to y;,,. Thus A, O[X; = 1] contains 
(n — k)! sample points, and so the conditional probability of X, = 1 is 1/(n — k + 1). 
| 


Existence of Independent Sequences 


The distribution of a simple random variable X is the probability measure u 
defined for all subsets A of the line by 


(5.12) u( A) =P[XEA]. 


This does define a probability measure. It is discrete in the sense of Example 
2.9: If Xr.. X ALAS distinct points of the range of X, then u has mass 
p= P(X =% 1 = Ma ab x, and u(A)= Lp;, the sum extending over those i 
for which x, €A. As w(A)=1 if A is the range of X, not only is u discrete, 
it has finite support. 


Theorem 5.3. Let {u,} be a sequence of probability measures on the class 
of all subsets of the line, each having finite support. There exists on some 
probability space (Q, F, P) an independent sequence {X,} of simple random 
variables such that X,, has distribution Hn- 


What matters here is that there are finitely or countably many distribu- 
tions w,. They need not be indexed by the integers; any countable index set 
will do. 
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Proor. The probability space will be the unit interval. To understand 
the construction, consider first the case in which each pw, concentrates its 
mass on the two points 0 and 1. Put p, =,,{0} and q, = 1—p, = 1,41}. Split 
(0, 1] into two intervals /) and 7, of lengths p; and q). Define X,(w) = 0 for 
w El, and X\(w)=1 for wël: If P is Lebesgue measure, then clearly 
P[X, = 0] =p, and P[X, = 1] =), 50 that X, has distribution p. 


x, =0 ¥,=1 
pn ee 
Pi di 
Pı P2 p142 4ı P2 9192 


Now split Jọ into two intervals J), and Jo, of lengths p, p, and p,q, and 
split J, into two intervals J,, and /,, of lengths q,p, and q,q,. Define 
X(@)=0 for w E wU l and X,(w)=1 for w € Io; U Jı- As the diagram 
makes clear, P[X,=0, X, =0] =p p2, and similarly for the other three 
possibilities. It follows that X, and X, are independent and X, has 
distribution u,. Now X; is constructed by splitting each of Ig, Ior 10> 44; 10 
the proportions p, and q}. And so on. 

If p, =q, => for all n, then the successive decompositions here are the 
decompositions of (0, 1] into dyadic intervals, and X,(w) =d,(w). 

The argument for the general case is not very different. Let x,,,..., Xn 
be the distinct points on which yz,, concentrates its mass, and put Pri = uxi 
for i <isl, 

Decompose’ (0, 1] into /, subintervals Hie ipben Ie of respective lengths 
Pir ---» Pu, Define X, by setting X (w) =x,, for w € If”, 1 <i <1,. Then (P 
is Lebesgue measure) P[w: X (w) = x,,] = PU!) =p,;, 1 <i <1,. Thus X, is 
a simple random variable with distribution p}. 

Next decompose each /{” into l, subintervals I?,..., 12 of respective 

2 
lengths p\,P21,---,P\;P2),- Define X,(w)=x,, for w E€ ULI, 1 <j <l 
Then Plo: X (w) =x,,, X (w) =X,,/] = PU) = pi, P2; Adding out 7 shows 
that Plw: X(w)=x,;]=p;, as required. Hence P[X,=x,,, X,=x>,]= 
PiiP2; = PIX, =x,,JPLX, =x,,], and X, and X, are independent. 

The construction proceeds inductively. Suppose that (0,1] has been de- 
composed into /, ++- /, intervals 


(5.13) UPd) 0° Oa aL i a T, 


tif b-a =ô] Jb one +6, and ô; 2 0, then I; = (a + Lj<iðj a+ Lj <i5;) decomposes (a,b] pos F 
subintervals /,,..., l; with lengths of ô;. Of course, /; is empty if ô; = 0. noii 
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of lengths 
(5.14) cas Ne 


Decompose 1%" , into 1,,, subintervals Le) asard P, Of respective 
lengths PU | DD asnn ss PUN i Phat "These are the intervals of the 
next decomposition. This construction gives ‘a sequence of decompositions 
(5.13) of (0, 1] into subintervals; each decomposition satisfies (5.14), and each 
refines the preceding one. If KL, is given for 1<n <N, the procedure 
terminates after N steps; for an infinite sequence it does not terminate at all. 

For 1<7S4,, put X,(w)=x,, if w € Oh Re IOR i: Since each de- 


da-i lj- 


composition (5.13) refines the preceding, ARO for o E N ipei 


Therefore, each element of (5.13) is contained in the element with the same 
label 7,...2, in the decomposition 


4 


Ain. cin = OCD) ere peek (Co) = oprah wairs ee cos Isi <l 


i= BE 


The two decompositions thus coincide, and it follows by (5.14) that 
PLX, = Xii 54--o hey eel Da, 29 Pee, Adding iout.the maices 7 seas 
shows that X,, has distribution uw, and hence that X,,..., X, are indepen- 
dent. But n was arbitrary. | 


In the case where the p,, are all the same, there is an alternative construction 
based on probabilities in sequence space. Let S be the support (finite) common to the 
Hn, and let p,, u E S, be the probabilities common to the w,,. In sequence space $”, 
define product measure P on the class @ of cylinders by (2.21). By Theorem 2.3, P is 
countably additive on @, and by Theorem 3.1 it extends to €= øo (6). The coordi- 
nate functions z,(-) are random variables on the probability space (S*, @, P); take 
these as the Bye Then (2.22) translates into PIX, T Up., Xn F u] Su `~ 
which is just what Theorem 5.3 requires in this special case. 


> 
n 


Probability theorems such as those in the next sections concern indepen- 
dent sequences {X,} with specified distributions or with distributions having 
specified properties, and because of Theorem 5.3 these theorems are true not 
merely in the vacuous sense that their hypotheses are never fulfilled. Similar 
but more complicated existence theorems will come later. For most purposes 
the probability space on which the X, are defined is largely irrelevant. 
Every independent sequence {(X,} satisfying P[X,,= 1]=p and P[X, = 0] = 
l—p is a model for Bernoulli trials, for example, and for an event like 
UX- [E_X > an), expressed in terms of the X, alone, the calculation of 
its probability proceeds in the same way whatever the underlying space 
(Q, F, P) may be. 
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It is, of course, an advantage that such results apply not just to some 
canonical sequence {X,,} (such as the one constructed in the proof above) but 
to every sequence with the appropriate distributions. In some applications of 
probability within mathematics itself, such as the arithmetic applications of 
run theory in the preceding section, the underlying © does play a role. 


Expected Value 


A simple random variable in the form (5.2) is assigned expected value or 
mean value 


(5.15) E[X] =E| Eada D 


There is the alternative form 


(5.16) E[X]= LixP[X =x], 


the sum extending over the range of X; indeed, (5.15) and (5.16) both 
coincide with LL... _.x,;P(A;). By (5.16) the definition (5.15) is consistent: 
different representations (5.2) give the same value to (5.15). From (5.16) it 
also follows that E[X] depends only on the distribution of X; hence 
AK SEY) at PIX —y |=. 

If X is a simple random variable on the unit interval and if the A, in (5.2) 
happen to be subintervals, then (5.15) coincides with the Riemann integral as 
given by (1.6). More general notions of integral and expected value will be 
studied later. Simple random variables are easy to work with because the 
theory of their expected values is transparent and free of technical complica- 
tions. 

As a special case of (5.15) and (5.16), 


(5.17) E[1,]=P(A). 


As another special case, if a constant a is identified with the random variable 
X(w) =a, then 


(5.18) E[a] =a. 
From (5.2) follows f(X) = ¥, f(x,)I > and hence 


(5.19) E[ f(X)] = U(x) P(A) = DF (x) PLX =x], 


the last sum extending over the range of X. For example, the kth moment 
E[X*] of X is defined by E[X*] = L,yPLX* =y], where y varies over the 
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range of X“, but it is usually simpler to compute it by ELX*] = L,x*PLX =x], 
where x varies over the range of X. 
If 


(5.20) X= tila Y= Lydy 
i j 


are simple random variables, then aX + BY = L, ;(ax; + By)I4,np, has ex- 
pected value X; (ax; + By,)P(A,N B,) = a£;x;P(A;) + BL, y;P(B,). Ex- 
pected value is therefore linear: 


(5.21) ElaX + BY])=aE[X]+BE[Y]. 


If X(w) < Y(w) for all w, then x;<y, if A;N B, is nonempty, and hence 
Lyx PCA: 1 B;) < Eye (A;N B,). Expected value therefore preserves or- 


der: 
S nA 


(5.22 E(X] SEY] ASY. 


(It is enough that X < Y on a set of probability 1.) Two applications of (5.22) 
give E[—|X|] < E[X] <E[|X]], so that by linearity, 


(5.23) |E[X]| <E[|X]]. 
And more generally, 
(5.24) |E[X-Y]| <E[|X-Y]]. 
The relations (5.17) through (5.24) will be used repeatedly, and so will the 
following theorem on expected values and limits. If there is a finite K such 


that |X (w)|<K for all w and n, the X,, are uniformly bounded. 


Theorem 5.4. If {X,} is uniformly bounded, and if X =\lim, X, with 
probability 1, then E[X]=lim, ELX,]. 


Proor. By Theorem 5.2(ii), convergence with probability 1 implies con- 
vergence in probability: X, >p X. And in fact the latter suffices for the 
present proof. Increase K so that it bounds |X| (which has finite range) as 
well as all the |X, |; then |X — X,,| < 2K. If A =[|X —X,|>€], then 
for all w. By (5.17), (5.18), (5.21), and (5.22), 


E[IX -X,]l] < 2KP||X —X,|>e] +.. 
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But since X„ >p X, the first term on the right goes to 0, and since e is 
arbitrary, E[|X — X,|] > 0. Now apply (5.24). a 


Theorems of this kind are of constant use in probability and analysis. For 
the general version, Lebesgue’s dominated convergence theorem, see Sec- 
tion 16. 


Example 5.7. On the unit interval, take X(w) identically 0, and take 
X,(w) to be n? if 0<@<n"' and 0 if n~'<w <1. Then X,(w) > X(w) for 
every w, although E[X,]=n does not converge to E[X]=0. Thus theorem 
5.4 fails without some hypothesis such as that of uniform boundedness. See 
also Example 7.7. a 


An extension of (5.21) is an immediate consequence of Theorem 5.4: 


Corollary. If X=,,X,, on an Fset of probability 1, and if the partial 
sums of L,,X,, are uniformly bounded, then E[X]=¥,, EL X,]. 


Expected values for independent random variables satisfy the familiar 
product law. For X and Y as in (5.20), XY= LijxXiYjla ng, If the x; are 
distinct and the y; are distinct, then A;=[X = x,] and B,=|Y=y,] for 
independent X and Y, P(A; 1 B;) = P(A,)P(B;) by (5.9), and so E[XY]= 
L;;x;y;P(A,)P(B,) = ELX JELY]. If X,Y,Z are independent, then XY and 
Z are independent by the argument involving (5.10), so that E[XYZ]= 
E[ XY JE[Z] = E[ X JE[Y JE[Z]. This obviously extends: 


(5.25) EG = 2G, | = 2 XG eee 


f X,,...,X, are independent. 
Various concepts from discrete probability carry over to simple random 
variables. If E[X] =m, the variance of X is 


(5.26) Var[ X] = E|(X—m)’] = E[ X?] - m?; 


the left-hand equality is a definition, the right-hand one a consequence of 
expanding the square. Since «X +P has mean am + B, its variance is 
E((aX + B) —(am + B))?] = Ela? (X — m)?}: 


(5.27) Var[aX + B] =a? Var[ X]. 


If X,,...,X, have means m,,...,m,, then § = L"_,X, has mean m = 
Lym, and E[S — m)*] = EKI? (X, -m)l = EEX, — m7] + 
22) <i<j<nEl(X; — m,XX; — m,)]. If the X, are independent, then so are the 
X; —m,, and by (5.25) the last sum vanishes. This gives the familiar formula 
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for the variance of a sum of independent random variables: 


n n 
(5.28) Vari 2, Xie Varl An 
d i=1 
Suppose that X is nonnegative; order its range: 0 SP ars 11 Ue 
Then 


k 
E[X]= L x;P[X=x,] 
k-1 
= 2 aE an] PIX rh EsP A 200 


t=] 


k 
~All Meee cy ee dan A N ELA 2X, 1- 
i=2 


Since P[X > x] = P[X >x] for 0 <x <x, and P[X>x]= PIX >x: for ge, 
<x <x; it is possible to write the final sum as the Riemann integral of a step 
function: 


(5.29) BLM) = f PLX =x] de. 


This holds if X is nonnegative. Since P[X >x]=0 for x >X% the range of 
integration is really finite. 

There is for (5.29) a simple geometric argument involving the “area over 
the curve.” If p, = P[X =x,], the area of the shaded region in the figure is 
the sum pix; +-+- +p,x, =E[X] of the areas of the horizontal Strips; it is 
also the integral of the height P[X > x] of the region. 
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Inequalities 


There are for expected values several standard inequalities that will be 
needed. If X is nonnegative, then for positive a (sum over the range of X) 
ELX] = E, xPLX =x] > Ly. ps qXPLX =x] 2aL,, ,5.PLX =x]. Therefore, 


(5.30) P[X >a] < ŻE[X] 


if X is nonnegative and a positive. A special case of this is (1.20). Applied to 
|X |*, (5.30) gives Markov’s inequality, 


1 
(5.31) P[|X|>a] < a Alki 


valid for positive a. If k=2 and m = E[X] is subtracted from X, this 
becomes the Chebyshev (or Chebyshev—Bienaymé) inequality: 


(5-32) P[|X —m|>a] < + Var[ X]. 
Q 


A function ¢ on an interval is convex [A32] if p( px + (1 — p)y) < pe(x) + 
(1 — p)e(y) for 0 <p <1 and x and y in the interval. A sufficient condition 
for this is that g have a nonnegative second derivative. It follows by 
induction that ¢(X!_; p;x;) < Li_, p;e(x,) if the p; are nonnegative and add 
to 1 and the x; are in the domain of g. If X assumes the value x, with 
probability p,, this becomes Jensen’s inequality, 


(5.33) p(E[X]) <E[e(X)], 


valid if g is convex on an interval containing the range of X. 
Suppose that 


1 1 
—+—=] > : 
(5.34) 5 hemo pe tail 
Holder’ s inequality is 
(5.35) E(\XY|] s E'/7?[|X|?]-£'74[1Y 17). 


If, say, the first factor on the right vanishes, then X = 0 with probability 1, 
hence XY=0 with probability 1, and hence the left side vanishes also. 
Assume then that the right side of (5.35) is positive. If a and b are positive, 
there exist s and ¢ such that a=e? ° and b=e% “. Since e* is convex, 
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=l -1 | as 
eP s+q ‘<p e5 +q En or 


This obviously holds for nonnegative as well as for positive a and b. Let u 
and v be the two factors on the right in (5.35). For each w, 


q 


aeroj. ae 


ep: 
q U 


Taking expected values and applying (5.34) leads to (5.35). 
lf p =q = 2, Hölder’s inequality becomes Schwarz’s inequality : 


S Re 
ow 


y 


\ 


Ww 


\ 


) E[|XY|) <E'| X?)-B'/[y?]. 


Suppose that 0<a<fB. In (5.35) take p=B/a, q=B/(B-—a), and 
)= 1, and replace X by |X|“. The result is Lyapounov’s inequality, 


5.37) s E ects 8 | ae Cleo 6) 


PROBLEMS 


Sale 


S2 


Fa 


5.4. 


5.5. 


(a) Show that X is measurable with respect to the o-field Y if and only if 
a(X) CY. Show that X is measurable o(Y) if and only if ¢(X) ca(Y). 

(b) Show that, if 4={@,Q}, then X is measurable Y if and only if X is 
constant. 


(c) Suppose that P(A) is 0 or 1 for every A in 4. This holds, for example, if 4 
is the tail field of an independent sequence (Theorem 4.5), or if g consists of 
the countable and cocountable sets on the unit interval with Lebesgue measure. 
Show that if X is measurable Y, then P[X =c]=1 for some constant GA 


2.197 Show that the unit interval can be replaced by any nonatomic probabil- 
ity measure space in the proof of Theorem 5.3. 


Show that m = E[X] minimizes E[(X — m)’]. 


Suppose that X assumes the values m—a,m,m +a with probabilities p, 1 — 
2p,p, and show that there is equality in (5.32). Thus Chebyshev’s inequality 
cannot be improved without special assumptions on X, 


Suppose that X has mean m and variance a’. 
(a) Prove Cantelli’s inequality 


2 


oO 
P[X-m2a]<— 


fA... os 6 a>0. 
+a? 
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5.6. 
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z 


5.10. 
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(b) Show that P[|X —m|>a]<207/(o? +a’). When is this better than 


Chebyshev’s inequality? 
(c) By considering a random variable assuming two values, show that Cantelli’s 


inequality is sharp. 
The polynomial El(¢|X|+|Y)?] in ż has at most one real zero. Deduce 
Schwarz’s inequality once more. 


(a) Write (5.37) in the form E*/|X|*] < E[|X|*)?7*] and deduce it directly 


from Jensen’s inequality. 
(b) Prove that E[1/X’]>1/E"[X] for p>0 and X a positive random vari- 


able. 


(a) Let f be a convex real function on a convex set C in the plane. Suppose 
that (X(@), Y(w)) € C for all œ and prove a two-dimensional Jensen’s inequal- 


(5.38) f(E[X], E[Y]) < E[f(X,Y)]. 


(b) Show that f is convex if it has continuous second derivatives that satisfy 


(5.39) fi 29, fr» = 0, fufa =fir- 

. t Hölder’s inequality is equivalent to E[X!/?Y'/7]<E!/"[X]-E'/YyY] 
(p~'+q~\=1), where X and Y are nonnegative random variables. Derive this 
from (5.38). 


T Minkowski’s inequality is 


(5.40) EXE YV] SE x | alba 


valid for p> 1. It is enough to prove that E[(X!/? + Y!/?)?] <(E'/9[X]+ 
E'/*[Y ])” for nonnegative X and Y. Use (5.38). 


For events A,, A,,..., not necessarily independent, let Na = Lea L4, be the 
number to occur among the first n. Let i 


(5.41) a, = 7 P(A,), Br = TT x P(A, NA,). 


lsj<k<n 


Show that 


(5.42) E[n 'N,] =a,  Var[n-'N,] =B —a2 + eal 


Thus Var[n~'N,] > 0 if and only if B,,—a?—>0, which holds if the A, are 
independent and P(A,,) =p (Bernoulli trials), because then a, =p and B, = 
p =a}. 
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5.12. Show that, if X has nonnegative integers as values, then E[X] = £7 -1P[X =n]. 


5.13. Let J,=J1, be the indicators of n events having union A. Let $, = Ld, aze 


k 


where the summation extends over all k-tuples satisfying 1 <i, < ‘** <i, <n. 
Then s, = E[S,] are the terms in the inclusion—exclusion formula P(A)= s; — 
S+ °> +s,. Deduce the inclusion—exclusion formula from J, = S,— $+ 


-=> + S„. Prove the latter formula by expanding the product [1/_ ,(1 — 1). 


5.14. Let f,(x) be n*x or 2n—n?x or 0 according as O<x<n-' or n '< 
x<2n_' or 2n _'<x<1. This gives a standard example of a sequence of 
continuous functions that converges to 0 but not uniformly. Note that [if (x) dx 
does not converge to 0; relate to Example 5.7. 


5.15. By Theorem 5.3, for any prescribed sequence of probabilities p,, there exists 
(on some space) an independent sequence of events A, satisfying P(A,,) = Pp- 
Show that if p, > 0 but Lp, = œ, this gives a counterexample (like Example 5.4) 
to the converse of Theorem 5.2(ii). 


5.16. 1 Suppose that 0<p, <1 and put a, =min{p,,1—p,}. Show that, if La, 
converges, then on some discrete probability space there exist independent 
events A,, satisfying P(A,,)=p,,. Compare Problem 1.1(b). 


5.17. (a) Suppose that X, >p X and that f is continuous. Show that fCX,,) >p f(X). 


(b) Show that E[|X —X,|]—0 implies X, >p X. Show that the converse is 
false. 


5.18. 2.207 The proof given for Theorem 5.3 for the special case where the w,, are 
all the same can be extended to cover the general case: use Problem 2.20. 


5.19. 2.187 For integers m and primes p, let a,(m) be the exact power of p in the 
prime factorization of m: m = II, p*’?. Let 5,(m) be 1 or 0 as p divides m or 
not. Under each P, (see (2.34)) the a, and 6, are random variables. Show that 


for distinct primes p,,..-, Puw 

f sat n 1 
(5.43) Pie ek i <u] T apk zE | p? pki es 
and 
(5.44) Pile, Ski <u) a M4, - sta}. 

im=1\Di' » Dj! 

Similarly, 

f 1 n 1 
5.45 Pld, =lisuls =|) ——— — , 
( ) A | Pi L u] qlee a | ee 


According to (5.44), the a, are for large n approximately independent under 
P„, and according to (5.45), the same is true of the 5,. 
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For a function f of positive integers, let 
1 n 
(5.46) MAAE O P h 
m=l1 


be its expected value under the probability measure P„. Show that 


1 . 
re 


(5.47) he jiu 
k=] 


this says roughly that (p — 1) ! is the average power of p in the factorization of 


large integers. 


. î (a) From Stirling’s formula, deduce 


(5.48) E,,[log] = log n + O(1). 


From this, the inequality E,l@,] <2/p, and the relation log m = Ł,a,(m) log p, 
conclude that E log p diverges and that there are infinitely many primes. 


(b) Let log* m = Ł,ô,(m)log p. Show that 
1 
(5.49) E, [log*] = |p| los 2 = loan + O(1). 
p 


(c) Show that |2n/p]—2|n/p] is always nonnegative and equals 1 in the 
range n <p <2n. Deduce E,,[log*] — E,,[log*] = O() and conclude that 


(5.50) Y log p=O(x). 


psx 


Use this to estimate the error removing the integral-part brackets introduces 
into (5.49), and show that 


(5.51) Lp 'log p = log x + O(1). 


psx 


(d) Restrict the range of summation in (5.51) to @x <p <x for an appropriate 
0, and conclude that 


(5.52) 2 fog p =x, 


psx 


in the sense that the ratio of the two sides is bounded away from 0 and œ. 
(e) Use (5.52) and truncation arguments to prove for the number m(x) of 
primes not exceeding x that 


P 
(5.53) m(x) 3 log x` 
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(By the prime number theorem the ratio of the two sides in fact goes to 1.) 
Conclude that the rth prime p, satisfies p,=rlogr and that 


l 
(5.54) y= =, 
7 
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The Strong Law 


Let X,, X,,... be a sequence of simple random variables on some probabil- 
ity space (Q, Z, P). They are identically distributed if their distributions (in 
the sense of (S.12)) are all the same. Define §,=X,+°+: +X,. The strong 
law of large numbers: 


Theorem 6.1. If the X, are independent and identically distributed and 
E[ X,,] =m, then 


(6.1) P| lim n7'S, =m| =1. 


n 


Proor. The conclusion is that n~'S, —m=n7~'v?_,(X;—m)—0 with 
probability 1. Replacing X; by X;—m shows that there is no loss of 
generality in assuming that m = 0. The set in question does lie in F (see 
(5.5)), and by Theorem 5.2(i), it is enough to show that P[|n~'S,|>e i.o.] =0 
for each e. 

Let E[X}?] =ø? and E[X;] = £*. The proof is like that for Theorem 1.2. 
First (see (1.26)), E[Si]= LE[X,X,X,X5], the four indices ranging inde- 
pendently from 1 to n. Since E[ X,] = 0, it follows by the product rule (5.25) 
for independent random variables that the summand vanishes if there is one 
index different from the three others. This leaves terms of the form E[X;/] = 
é4, of which there are n, and terms of the form E[X7X7] = E[X7 JEL X7)=e4 
for i #j, of which there are 3n(n — 1). Hence 


(6.2) E[S4] =né* + 3n(n - 1)0* < Kn’, 


where K does not depend on n. 
By Markov’s inequality (5.31) for k = 4, P[|S„| > ne] < Kn~*e~*, and so by 
the first Borel-Cantelli lemma, P[|n~'S,,| > € i.o.] = 0, as required. B 


Example 6.1. The classical example is the strong law of large numbers for 
Bernoulli trials. Here P[X,„, = 1] =p, PLX, =0]= 1- p, m =p; S„ represents 
the number of successes in n trials, and n~'S, > p with probability 1. The 
idea of probability as frequency depends on the long-range stability of the 
success ratio S,,/n. E] 
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Example 6.2. Theorem 1.2 is the case of Example 6.1 in which (Q, F, P) 
is the unit interval and the X,(w) are the digits d,(w) of the dyadic 
expansion of w. Here p= L, The set (1.21) of normal numbers in the unit 
interval has by (6.1) Lebesgue measure 1; its complement has measure 0 (and 
so in the terminology of Section 1 is negligible). i 


The Weak Law 


Since convergence with probability 1 implies convergence in probability 
(Theorem 5.2(ii)), it follows under the hypotheses of Theorem 6.1 that 
n` 'S, +p m. But this is of course an immediate consequence of Chebyshev’s 
inequality (5.32) and the rule (5.28) for adding variances: 


Var[S,] nVar[ X;] 
P me in > <= ear Ec = ——————— > 0. 
[|n a =] e] nÆ? PE 
This is the weak law of large numbers. 
Chebyshev’s inequality leads to a weak law in other interesting cases as 
well: 


Example 6.3. Let ©, consist of the n! permutations of 1,2,...,n, all 
equally probable, and let X,,(w) be 1 or 0 according as the kth element in 
the cyclic representation of w E ,, completes a cycle or not. This is Example 
5.6, although there the dependence on n was suppressed in the notation. The 
X a- Ann are independent, and S,=X,,+--- +X,,, is the number of 
cycles. The mean m,, of X,, is the probability that it equals 1, namely 
(n —k +1)7', and its variance is 0, =m,,(1 —m,,). 

If L,=Lz_,k~', then S, has mean Xg imk L, and variance 
Ez ,m, (1 — Mpg) <L, By Chebyshev’s inequality, 


Sn Ln 


p|| 2 


n 


Z Ln ahi 0 
= = 
€ Lac ae by 


>€ 


Of the n! permutations on n letters, a proportion exceeding 1 — e7 °L] ' thus 
have their cycle number in the range (1 +e)L,„. Since L, =logn + O()), 
most permutations on n letters have about log n cycles. For a refinement, see 
Example 27.3, 

Since (1, changes with n, it is the nature of the case that there cannot be 
a strong law corresponding to this result. w 


Bernstein’s Theorem 


Some theorems that can be stated without reference to probability nonethe- 
less have simple probabilistic proofs, as the last example shows. Bernstein’s 
approach to the Weierstrass approximation theorem is another example. 
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Let f be a function on [0,1]. The Bernstein polynomial of degree n 
associated with f is 


(6.3) B,(x) = DAF) (z) sta _ x)" 


Theorem 6.2. If f is continuous, B,(x) converges to f(x) uniformly on 
[0, 1]. 


According to the Weierstrass approximation theorem, f can be uniformly 
approximated by polynomials; Bernstein’s result goes further and specifies an 
approximating sequence. 


Proor. Let M = sup, |f(x)|, and let 5(€) = sup[| f(x) — f(y): lx — yl se] 
be the modulus of continuity of f. It will be shown that 


(6.4) sup | f(x) = 8,(x)| <6(€) + =. 


By the uniform continuity of f, lim, _,, 6(€) =0, and so this inequality (for 
€ =n! say) will give the theorem. 

Fix n>1 and x €[0,1] for the moment. Let X,,..., X, be independent 
random variables (on some probability space) such that P| X,=1]=x and 
PLX,=0]=1—x; put S=X,+--- +X,. Since P[S=k]=(" be* —xy"-*, 
the formula (5.19) for calculating expected values of functions of random 
variables gives E[ f(S/n)] = B,(x). By the law of large numbers, there should 
be high probability that S/n is near x and hence (f being continuous) that 
f(S/n) is near f(x); E[f(S/n)] should therefore be near f(x). This is the 
probabilistic idea behind the proof and, indeed, behind the definition (6.3) 
itself. 

Bound | f(n~'S) — f(x)| by 5(e) on the set [|n~'S —x|<e] and by 2M on 
the complementary set, and use (5.22) as in the proof of Theorem 5.4. Since 
E[S] = nx, Chebyshev’s inequality gives 


|B,(x) —f(*)| <E[| f(a 8) — fll 
< 6(€)P[In-'S —x|<e]+2MP[|n—'S—xl>€] 
<5(e) +2M Var[S]/n’e?; 
since Var[S] = nx(1 — x) < n, (6.4) follows. a 


A Refinement of the Second Borel—Cantelli Lemma 


For a sequence A,,A,,... of events, consider the number N, =J A 
+ +++ +1, of occurrences among Aj,,...,A,. Since [A, i.o.] =[w: 
sup, N,(w) =], P[A,, i.o.] can be studied by means of the random varia- 
bles N,. 
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Suppose that the A, are independent. Put p,=P(A,) and m, =p, 
oT Ses dp From EL, =p, and Var[74,] = p0 — p,) <p, follow ELN,] = 
m,, and Var[N,]= kat VAN lA, PE n if m, > x, then 


(6.5) P[N, <x] <P[|N, —m,l=m, -x] 
Var N,, | m 
ee 


~ (m,-%) (Mm, =k) 


neo 


n 


If Lp, =œ, so that m, > œ, it follows that lim, P[N, <x]=0 for each x. 
Since 


(6.6) P =P(N, 2x | 


sup N, <x 
k 


P{sup, N, <x]=0 and hence (take the union over x = 1,2,...) Plsup, N, < 
œ] = 0. Thus P[A,, i.o.] = Plsup, N; = œ] = 1 if the A, are independem ea 
Lp, =œ, which proves the second noe Cantelli lemma once again. 

Independence was used in this argument only to estimate Var[ N,]. Even 
without independence, E[N,]=m,, and the first two inequalities in (6.5) 
hold. 


n 


Theorem 6.3. If LP(A,,) diverges and 


Yy P(A,OA,) 
(6.7) lim tl <k 
stile P(A.) 


then P[ A, i.o.] = 1. 


As the proof will show, the ratio in (6.7) is at least 1; if (6.7) holds, the 
inequality must therefore be an equality. 


Proor. Let 6, denote the ratio in (6.7). In the notation above, 


Var[N,]=E[NZ] -m= ¥ Ehita] — 
j, kan 
= $, P(A;NA,)—m} = (8, =- 1)m3 
J, kan 


(and 0„— 120). Hence (see (6.5)) PIN, <x] < (0, —l)m;Am, —x)? for 
x <m,,. Since m2/m,,—x)* > 1, (6.7) implies that liminf,, P[N, <x] = 0. It 
still follows by (6.6) that P[sup, N, <x]=0, and the rest of the argument is 
as before. a 
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Example 6.4. If, as in the second Borel—Cantelli lemma, the A, are 
independent (or even if they are merely independent in pairs), the ratio in 
(6.7) is 1 +L, <,(p, —pp)/m?2, so that EP(A,„) = © implies (6.7). a 


Example 6.5. Return once again to the run lengths /,(w) of Section 4. It 
was shown in Example 4.21 that {r,,} is an outer boundary (P[/, >r, i.o.] = 0) 
if £2 ™* <œ, It was also shown that {r,} is an inner boundary (PI/, =r, 
i.o.]=1) if r, is nondecreasing and £2~""r,' = œ, but Theorem 6.3 can be 
used to prove this under the sole assumption that £27" = œ, 

As usual, the r, can be taken to be positive integers. Let A, = [l > r,] = 


[da= t =d,4,-, =O) If tr a k, then A, and A, aredndepencenrarn 
j<k<j+r,, then PCA,|A,) < Pld; = °° =a) l= 0)Ap See 
d,_, =0])=1/2*”, and so P(A, QAD = P(A,)/2" TNCS 
2. P(A; NA,) 
J, k<n 
< $, P(A,)+2  P(A;))P(A,) +2 X 2 eae 
k<n j<k<n j<k<n 
J+r,<k k<j+rj 
2 
< E P(A.) th EECA) +2 Aa 
k<n k<n k<n 


If LP(A,,) = L2~™ diverges, then (6.7) follows. 

Thus {r,} is an outer or an inner boundary according as £2~'™ converges or 
diverges, which completely settles the issue. In particular, r, = log, n + 
6 log, log, n gives an outer boundary for 6> 1 and an inner boundary for 
0<1. a 


Example 6.6. It is now possible to complete the analysis in Examples 4.12 and 
4.16 of the relative error e,(w) in the approximation of w by LZ =!d,(w)2~*. If 
1,(w) > —log, x, (0<x, <1), then e,(w) <x, by (4.22). By the preceding example 


for the case r, = — log, x,, Lx, =% implies that Plw: e,(w) <x, i.o.]= 1. By this 
and Example 4.12, [w: e,(w) <x,, 1.0.] has Lebesgue measure 0 or 1 according as © x,, 
converges or diverges. a 
PROBLEMS 


6.1. Show that Z, — Z with probability 1 if and only if for every positive e there 
exists an n such that P[|Z, —Z|<e, n<k<m]>1-—e for all m exceeding n. 
This describes convergence with probability 1 in “finite” terms. 


6.2. Show in Example 6.3 that P[|S,, — L, > L’? +] > 0. 


6.3. As in Examples 5.6 and 6.3, let w be a random permutation of 1,2,..., n. Each 
k, 1<k <n, occupies some position in the bottom row of the permutation wœ; 


6.4. 


6.5. 


In the following problems S, = X,+ °°: +X,,. 


6.6. 


6.8. 


6.9. 


6.10. 


6.11. 


6.12. 


. (a) Let x,,x,,... be a sequence of real numbers, and puts =x, + 72> e« 
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let X,,(w) be the number of smaller elements (between 1 and k — 1) lying to 
the right of k in the bottom row. The sum S, =X,, + °°: +X,,, is the total 
number of inversions—the number of pairs appearing in the bottom row in 
reverse order of size. For the permutation in Example 5.6 the values of 
Cer X,, are 0, 0, 0, 2, 4, 2, 4, and S,=12. Show that X,),...,X,, ate 
independent and P[X,, =i]=k~' for 0<i<k. Calculate E[S,,] and Var{S,]. 
Show that S,, is likely to be near R JÀ. 

For a function f on [0,1] write ||fl|=sup,|f(x)|. Show that, if f has a 
continuous derivative f', then ||f—B,|l <ellf'll+ 2Ilfll /ne?. Conclude that 
If- B, = 0n °°). 


Prove Poisson’s theorem: If A, A,... are independent events, p, = 
noo dj. PC A;), and N = D= a, then nN, =D), FP. 0. 


n 


Prove Cantelli’s theorem: If X,, X>,... are independent, E[X,,]=0, and E[ X77] 
is bounded, then n~ 'S, > 0 with probability 1. The X,, need not be identically 
distributed. 


Suppose that n *s,2 > 0 and that the x, are bounded, and show that n~'s, > 0. 
(b) Suppose that n~*S,2— 0 with probability 1 and that the X, are uniformly 
bounded (sup,, ,,|X,,(@)| <œ). Show that n~'S, > 0 with probability 1. Here 
the X, need not be identically distributed or even independent. 


T Suppose that X,, X,,... are independent and uniformly bounded and 
E{ X,,] = 0. Using only the preceding result, the first Borel—Cantelli lemma, and 


n 


Chebyshev’s inequality, prove that n` ISa > 0 with probability 1. 


T Use the ideas of Problem 6.8 to give a new proof of Borel’s normal number 
theorem, Theorem 1.2. The point is to return to first principles and use only 
negligibility and the other ideas of Section 1, not the apparatus of Sections 2 
through 6; in particular, P(A) is to be taken as defined only if A is a finite, 
disjoint union of intervals. 


5.11 6.717 Suppose that (in the notation of (5.41)) B, —a? = O(1 /n). Show 
that n` 'N, —a,, > 0 with probability 1. What condition on B, — a2 will imply a 
weak law? Note that independence is not assumed here. 


Suppose that X,, X,,... are m-dependent in the sense that random variables 
more than m apart in the sequence are independent. More precisely, let 
A = o(X;,...,X,), and assume that .04*',,.,,.044' are independent if k;_, + 
m <j; for i=2,...,/. Independent random variables are 0-dependent.) Sup- 
pose that the X, have this property and are uniformly bounded and that 
E[X,]= 0. Show that n-'S,—> 0. Hint: Consider the subsequences 
Xj, Xim+iAirum+t,., for lsism + 1, 


T Suppose that the X, are independent and assume the values x,,..., x, with 
probabilities p(+,),..., p(x). For uy,...,u, a k-tuple of the xs, let 
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N,(u,,...,u,) be the frequency of the k-tuple in the first n + k — 1 trials, that 
is, the number of t such that 1 <t <n and X,=wu,,..., X,4,—1 =4,- Show that 


6.14. 


6.15. 


6.16. 


with probability 1, all asymptotic relative frequencies are what they should 
be—that is, with probability 1, n~'N,(u,,...,u,) > p(u,) +++ p(u,) for every k 
and every k-tuple u,,...,U,. 


. ? A number w in the unit interval is completely normal if, for every base b 


and every k and every k-tuple of base-b digits, the k-tuple appears in the base-b 
expansion of w with asymptotic relative frequency b~*. Show that the set of 
completely normal numbers has Lebesgue measure 1. 


Shannon’s theorem. Suppose that X,, X,,... are independent, identically dis- 
tributed random variables taking on the values 1,...,7 with positive probabili- 
ties Prr- Pr Tf Pape. -sin) =P pi and p,(w)=p,(X(),..., X,(w)), 


then p„(w) is the probability that a new sequence of n trials would produce the 
particular sequence X,(w),..., X (w) of outcomes that happens actually to have 
been observed. Show that 


r 


1 
= Oe D,(@) St) = LP; log p; 


i=1 


with probability 1. 

In information theory 1,...,7 are interpreted as the letters of an alphabet, 
Xi, X>,... are the successive letters produced by an information source, and h 
is the entropy of the source. Prove the asymptotic equipartition property: For 
large n there is probability exceeding 1 —e that the probability p,(w) of the 
observed n-long sequence, or message, is in the range e "H +9, 


In the terminology of Example 6.5, show that log, n + log, log, n + 
6 log, log, log, n is an outer or inner boundary as @> 1 or 6 < 1. Generalize. 


(Compare Problem 4.12.) 


5.201 Let g(m)=£,6,(m) be the number of distinct prime divisors of m. For 
a, =E,[g] (see (5.46)) show that a, > œ. Show that 


plela)  * 


for p +q and hence that the variance of g under P, satisfies 


(6.8) E,|(6,- = 


1 
(6.9) Var, [g]<3 = 
psn 
Prove the Hardy—Ramanujan theorem: 
(6.10) lim P,|m: sim) = >e] =0. 


Since a, ~ loglog n (see Problem 18.17), most integers under n have something 
like loglog n distinct prime divisors. Since log log 10’ is a little less than 3, the 
typical integer under 10’ has about three prime factors—remarkably few. 
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6.17. Suppose that X,, X>,... are independent and P[X, = 0] =p. Let L,, be the 


length of the run of 0’s starting at the mth place: L, =k if A,* : + ae 1 
=0+X,,,. Show that PIL, >r, i.0.] is 0 or 1 according as i p"™ converges or 


diverges. Example 6.5 covers the case p = H, 


SECTION 7. GAMBLING SYSTEMS 


Et Xi Aas be an independent sequence of random variables (on some 
a. SF "P)) raking on the two values +1 and —1 with probabilities PiX, = 
+1]=p and P[X, = —1])=q=1-p. Throughout the section, X, will be 


viewed as the gambler’s gain on the nth of d series of plays at unit stakes. 
The game is favorable to the gambler if p > $, fair if p = 2, and unfavorable 
if p < 4. The case p < } will be called the subfair case. 

After the classical gambler’s ruin problem has been solved, it will be 
shown that every gambling system is in certain respects without effect and 
that some gambling systems are in other respects optimal. Gambling prob- 
lems of the sort considered here have inspired many ideas in the mathemati- 
cal theory of probability, ideas that carry far beyond their origin. 

Red-and-black will provide numerical examples. Of the 38 spaces on a 
roulette wheel, 18 are red, 18 are black, and 2 are green. In betting either on 
red or on black the chance of winning is 3. 


Gambler’s Ruin 


Suppose that the gambler enters the casino with capital a and adopts the 
strategy of continuing to bet at unit stakes until his fortune increases to c or 
his funds are exhausted. What is the probability of ruin, the probability that 
he will lose his capital, a? What is the probability he will achieve his goal, c? 
Here a and c are integers. 

Let 


(7.1) S a eke So = 0. 
The gambler’s fortune after n plays is a+ S,. The event 


n-1 


(7.2) A,n=(¢+8, =O 0 lo me eee 
k=1 


represents success for the gambler at time n, and 


n-i 
(7.3) B, „n= [a +S, =0]NA () [0<a+5S,<c] 
k=1 
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represents ruin at time n. If s (a) denotes the probability of ultimate success, 
then 


(7.4) 5.(a) -P| U Ana = b ETH 


n=1 n=1 


for 0<a<c. 

Fix c and let a vary. For n> 1 and 0 <a <c, define A, „ by (7.2), and 
adopt the conventions A, o= Ø for 0<a<c and A, y= (success is 
impossible at time 0 if a < c and certain if a = c), as well as Ay,,=A,, =P 
for n > 1 (play never starts if a is 0 or c). By these conventions, s.(0) = 0 and 
s(c) = if, 

Because of independence and the fact that the sequence X,, X3,... is a 
probabilistic replica of X,, X5,..., it seems clear that the chance of success 
for a gambler with initial fortune a must be the chance of winning the first 
wager times the chance of success for an initial fortune a+1, plus the 
chance of losing the first wager times the chance of success for an initial 
fortune a — 1. It thus seems intuitively clear that 


(7.5) s(a) = pS.(@ > l) eras (a aalaye 0 <a <a: 


For a rigorous argument, define A’,,, just as A,, but with S,—X, 
+---+X,,, in place of S, in (7.2). Now P[X; =x; i <n]=P[Xi+ =x; 


i<n] for each sequence x,,...,x, of +1s and —IS) and @therctore 
PI(X,,..., X) € H]=PICG,..., X,4,;) € HI for HERD ke EOE 
set of x =(x,,--.,X,) in R” satisfying x; = th a Ta ae eee 
O0<at+x,+--: +x, <c for k <n. It follows then that 

(7.6) B CATT) = P(A nN 


Moreover, A, n= (X, = +1 NA rn- VU NA ,-;) for nm 
>1 and 0<a<c. By independence and (7.6), P(A, n) PRCA 
gP(A,_,,-1); adding over n now gives (7.5). Note that this argument 
involves the entire infinite sequence X,, X),.... 

It remains to solve the difference equation (7.5) with the side conditions 
s(0) =0, s.(c) = 1. Let p =q/p be the odds against the gambler. Then [A19] 
there exist constants A and B such that, for 0<a<c, s(a)=A+Bp* if 
p +q and s(a)=A+Ba if p=q. The requirements s.(0) = 0 and s(c)=1 
determine A and B, which gives the solution: 

The probability that the gambler can before ruin attain his goal of c from an 
initial capital of a is 


a 0<a<c, if p= aE 
pos P 
—, O<a<c, ifp=—=1. 
c P 
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Example 7.1. The gambler’s initial capital is $900 and his goal is $1000. If 
p = 3, his chance of success is very good: Sioo(900) = .9. At red-and-black, 
p = 48 and hence p = 7; in this case his chance of success as computed by 
(7.7) is only about .00003. a 


Example 7.2. It is the gambler’s desperate intention to convert his $100 
into $20,000. For a game in which p = 3 (no casino has one), his chance of 
success is 100/20,000 = .005; at red-and-black it is minute—about 3 x 107°"! 

a 


In the analysis leading to (7.7), replace (7.2) by (7.3). It follows that (7.7) 
with p and q interchanged (p goes to p~!) and a and c —a interchanged 
gives the probability r (a) of ruin for the gambler: r(a)=(p “9-1 
(p °-1) if p#1 and r(a)=(c—a)/c if p=1. Hence s(a)+r(a)=1 
holds in all cases: The probability is 0 that play continues forever. 

For positive integers a and b, let 


H, = v is, =o) jar tee <s, <+] 


n=1 


be the event that S$, reaches +b before reaching —a. Its probability is simply 
(7.7) with c =a +b: P(A, ,)=5,4,(a). Now let 


pe ee aS Se 


sup S, > b] 


be the event that S,, ever reaches +b. Since H, 1? H, as a > œ it follows 
that P(H,)= lim, S, +(a); this is 1 if p = 1 or p <1, and it is 1/0 if pl. 
Thus 


1 if p>q, 


boas sibs ad i * ta if p <q. 


This is the probability that a gambler with unlimited capital can ultimately gain 
b units. 


Example 7.3. The gambler in Example 7.1 has capital 900 and the goal of 
winning b = 100; in Example 7.2 he has capital 100 and b is 19,900. Suppose, 
instead, that his capital is infinite. If p = 5, the chance of achieving his goal 
increases from .9 to 1 in the first example and from .005 to 1 in the second. 
At red-and-black, however, the two probabilities .9' and .9!99° remain 
essentially what they were before (.00003 and 3 x 10~?!'), a 
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Selection Systems 


Players often try to improve their luck by betting only when in the preceding 
trials the wins and losses form an auspicious pattern. Perhaps the gambler 
bets on the nth trial only when among X,,..., X,_, there are many more 
+1’s than —1’s, the idea being to ride winning streaks (he is “in the vein”). 
Or he may bet only when there are many more —1’s than +1’s, the idea 
being it is then surely time a +1 came along (the “maturity of the chances”). 
There is a mathematical theorem that, translated into gaming language, says 
all such systems are futile. 


It might be argued that it is sensible to bet if among X,,...,X,,_, there is an 


excess of +1’s, on the ground that it is evidence of a high value of p. But it is 
assumed throughout that statistical inference is not at issue: p is fixed—at E, for 
example, in the case of red-and-black—and is known to the gambler, or should be. 


The gambler’s strategy is described by random variables B,, B,,... taking 
the two values 0 and 1: If B, = 1, the gambler places a bet on the nth trial; if 
B,, = 0, he skips that trial. If B, were (X,, + 1)/2, so that B, = 1 for X, = +1 
and B,=0 for X,,= —1, the gambler would win every time he bet, but of 
course such a system requires he be prescient—he must know the outcome 
X,, in advance. For this reason the value of B,, is assumed to depend only on 
the values of X,,..., X,,_,: there exists some function b,: R”! > R! such 
that 


(7.9) B= bi Xen poe 


(Here B, is constant.) Thus the mathematics avoids, as it must, the question 
of whether prescience is actually possible. 
Define 


FO Xone eee w= 2)... 


(7.10) F, =(9,9). 
The o-field Y,_, generated by X,,..., X,_, corresponds to a knowledge of 
the outcomes of the first n — 1 trials. The requirement (7.9) ensures that B, 
is measurable Y,_, (Theorem 5.1) and so depends only on the information 
actually available to the gambler just before the nth trial. 

For n=1,2,..., let N, be the time at which the gambler places his nth 
bet. This nth bet is placed at time k or earlier if and only if the number 
Li_,B, of bets placed up to and including time k is n or more; in fact, 
N, is the smallest k for which Lf_,B,=n. Thus the event [N, <k] coin- 
cides with [L*_,B,>n]; by (7.9) this latter event lies in o(B,,..,, Bee 
o(X,,...,X,-1) = H_p Therefore, 


(7.11) [N =k]=([N, sk] -(N, sk-1] E F 
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(Even though [N, =k] lies in F,_, and hence in F, N, is, as a function on 
Q, generally not a simple random variable, because it has infinite range. This 
makes no difference, because expected values of the N, will play no role; 
(7.11) is the essential property.) 

To ensure that play continues forever (stopping rules will be considered 
later) and that the N, have finite values with probability 1, make the further 
assumption that 


(7.12) P[B, =1i.o.] =1. 


A sequence {B,} of random variables assuming the values 0 and 1, having the 
form (7.9), and satisfying (7.12) is a selection system. 

Let Y, be the gambler’s gain on the nth of the trials at which he does bet: 
Y, = X,,. It is only on the set [B, = 1 i.o.] that all the N, and hence all the Y, 
are well defined. To complete the definition, set Y, = —1, say, on [B, =Í 
i.o.]; since this set has probability 0 by (7.12), it really makes no difference 
how Y, is defined on it. 

Now Y, is a complicated function on © because Y,(w) =Xvw w0). 
Nonetheless, 


lo: Tio ae U (lo: neaka a a 


lies in Z, and so does its complement [w: Y (w) = — 1]. Hence Y, is a simple 
random variable. 


Example 7.4. An example will fix these ideas. Suppose that the rule is 
always to bet on the first trail, to bet on the second trial if and only if 
X, = +1, to bet on the third trial if and only if X, =X,, and to bet on all 
subsequent trails. Here B, = 1, [B,=1]=[X, = +1], [B; = 1] = [X =X], 
and B, =B, = ++- =1. The table shows the ways the gambling can start out. 
A dot represents a value undetermined by X,, X,, X}. Ignore the rightmost 
column for the moment. 


has 
a 
= 
D 
= 
Z 
z 


NoNo Noa P AR 
—f eb sylevther 0 pee q A gmr y bapi a 1 
r e other. 0 cov) EE E bee Gt O 1 
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AA +f “at TT a ee re 1 
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In the evolution represented by the first line of the table, the second bet is 
placed on the third trial (N, = 3), which results in a loss because Y, = Xy, = 
X,= —1. Since X,= —1, the gambler was “wrong” to bet. But remember 
that before the third trial he does not know X,(w) (much less w itself); he 
knows only X (w) and X,(w). See the discussion in Example 5.5. a 


Selection systems achieve nothing because {Y,} has the same structure as 
Í 4 \. 
\ ae Js 


Theorem 7.1. For every selection system, {Y,} is independent and P(Y, = 
+1])=p, PLY, = -1]=q. 


Proor. Since random variables with indices that are themselves random 
variables are conceptually confusing at first, the w’s here will not be sup- 
pressed as they have been in previous proofs. 

Relabel p and q as p(+1) and p(—1), so that Plw: X,(w) =x] = p(x) for 
x= +1. If AE ¥,_,, then A and [w: X,(w) =x] are independent, and so 
P(A Nlo: X,(@) =x]) = P(A)p(x). Therefore, by (7.11), 


P[w: Y,(@) =x] =P[o: Xy~4(@) =x] 
L Plo: N,(w) =k, X,(@) =x] 
k=1 


E Plo: N,(@) =] p(x) 
k=1 


= p(x). 
More generally, for any sequence x,,...,x, of +1’s, 


P 0: YA@)=%,, Pan = Pla: Xn ww) @) = Xi; i<n| 
= È Plo: N() =k, X,(@) =x,isn}, 


ke TIER 


n 


where the sum extends over n-tuples of positive integers satisfying ki <omes 
<k,,. The event [w: N(w)=k;, i <n]N[w: X,(w)=x,, i <n] lies in A, _, 
(note that there is no condition on X, (w)), and therefore 
Pl w: Y,(w) =x;,i <n] 
= DE P([w: N(w)=k;i<n] 


EE 


N[o: X,(@) =x; i <n|)p(x,)- 
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Summing k, over k,_, +1, k,~,+2,... brings this last sum to 


ds Pla: N(w) =k;, X,(@) =x; i <n] p(x) 


k,< eek <ky-1 
= Plo: Xnw)(@) = Xis i <n] p(x,) 


=Po Yo) =x) <n pC 
It follows by induction that 


Plo: Y,(w) =x;,i<n| = [| p(x,;) = [] Plo: Y,(@) =x,], 


i<n i<n 


and so the Y, are independent (see (5.9)). © 


Gambling Policies 


There are schemes that go beyond selection systems and tell the gambler not 
only whether to bet but how much. Gamblers frequently contrive or adopt 
such schemes in the confident expectation that they can, by pure force of 
arithmetic, counter the most adverse workings of chance. If the wager 
specified for the nth trial is in the amount W, and the gambler cannot see 
into the future, then W, must depend only on X,,..., X,_,. Assume there- 
fore that W, is a nonnegative function of these random variables: there is an 
fo R"~'— R' such that 


(7.13) W, =F X >. ke! 


Apart from nonnegativity there are at the outset no constraints on the fe 
although in an actual casino their values must be integral multiples of a basic 
unit. Such a sequence {W,} is a betting system. Since W, = 0 corresponds to a 
decision not to bet at all, betting systems in effect include selection systems. 
In the double-or-nothing system, W, = 2"~' if X = --- =X, ,=-1(W,= 
1) and W, = 0 otherwise. 


The amount the gambler wins on the nth play is W,X,„. If his fortune at 
time n is F,, then 


(7.14) Pj= Fe) te aes 


This also holds for n = 1 if F, is taken as his initial (nonrandom) fortune. It 


is convenient to let W, depend on Fy as well as the past history of play and 
hence to generalize (7.13) to 


(7.15) W, =8,( Fo, X,,-..,X,_1) >0 
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for a function g,: R"—>R'. In expanded notation, W,(w) =8,(Fo, X (w), 
...,X,,_ (@)). The symbol W, does not show the dependence on w or on Fo, 
either. For each fixed initial fortune Fy, W, is a simple random variable; by 
(7.15) it is measurable ¥,_,. Similarly, F, is a function of Fy as well as of 
PKG (a), 05 X, w): F, = F (Fo, œ). 
If Fo =O and g, = 1, the F, reduce to the partial sums (7.1). 

Since ¥,_, and o(X,) are independent, and since W, is measurable 
Y,—, (for each fixed Fo), W, and X, are independent. Therefore, ELW, X,,] 
= E[W,,]-ELX,]. Now ELX,]=p—q <0 in the subfair case (p < 4), with 
equality in the fair case (p = 3). Since E(W,,] > 0, (7.14) implies that E[ F,,] < 
E| F,,_,]. Therefore, 


(7.16) hy = BLP |e oo ee alee 
in the subfair case, and 
CAEN F= E[F]= = Elp] 


in the fair case. (If p <q and P[W, > 0] > 0, there is strict inequality in 
(7.16).) Thus no betting system can convert a subfair game into a profitable 
enterprise. 

Suppose that in addition to a betting system, the gambler adopts some 
policy for quitting. Perhaps he stops when his fortune reaches a set target, or 
his funds are exhausted, or the auguries are in some way dissuasive. The 
decision to stop must depend only on the initial fortune and the history of 
play up to the present. 

Let 7(F), w) be a nonnegative integer for each w in Q and each F, = 0. If 
7 =n, the gambler plays on the nth trial (betting W,) and then stops; if 7 = 0, 
he does not begin gambling in the first place. The event [w: r(Fy, ) =n] 
represents the decision to stop just after the nth trial, and so, whatever value 
F, may have, it must depend only on X,,..., X,. Therefore, assume that 


(7.18) [w: 7( Fo, w) =n] E F, aE S ET 


A 7 satisfying this requirement is a stopping time. (In general it has infinite 
range and hence is not a simple random variable; as expected values of r+ 
play no role here, this does not matter.) It is technically necessary to let 
7(F),w) be undefined or infinite on an w-set of probability 0. This has no 
effect on the requirement (7.18), which must hold for each finite n. But it is 
assumed that 7 is finite with probability 1: play is certain to terminate. 

A betting system together with a stopping time is a gambling policy. Let mr 
denote such a policy. 


Example 7.5. Suppose that the betting system is given by W,=B,, with 
B, as in Example 7.4. Suppose that the stopping rule is to quit after the first 
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loss of a wager. Then [tr =n] = Uk- Ny = fn, Yje (oO payer, ee 

—1]. For j <k <n, (N, =n, ¥,=x] = Uhal =n, Nas m, x, = x] lies in 
FZ. by (7.11); hence 7 is a stopping time. The values of 7 are shown in the 
rightmost column of the table. a 


The sequence of fortunes is governed by (7.14) until play terminates, and 
then the fortune remains for all future time fixed at F, (with value Fyr, ,.(w)). 
Therefore, the gambler’s fortune at time n 1s 


r E. iman, 
C) sits i UT se 
Note that the case t=n is covered by both clauses here. If n—1<n <r, 
then F* =F =F, 1+ W, X, =P EW, Xn if 7sn—1 47 iene 


F =F a Theretore, if Wo =F W,,, then 


[T> n] 


(7.20) F*=F* +i 


i i (Fe nl neen = RT We xe. 

But this is the equation for a new betting system in which the wager placed 
at time n is W*. If r >n (play has not already terminated), W,* is the old 
nee W; if 7 <n (play has terminated), W,* is 0. Now by (7. 18), [r>n]= 
[7 <n]f lies ins a Thusi ys measurable F,_,, SO that W,* as well as 
a is measurable ¥,_,, and {W,*} represents a legitimate betting system. 
Therefore, (7.16) and (7.17) apply to the new system: 


(7.21) F,=F*>E[F*])> --- >E[F*]> -- 
if p < 4, and 

I2) Fy = F* = E[F*])=+-- =E[F*]=... 
if p=5 


The gambler’s ultimate fortune is F_. Now lim,, F* = F_ with probability 1, 
since in fact F* = F, for n > 7. If 


(723) lim E[ F*]=E[F.], 


then (7.21) and (7.22), respectively, imply that E[F,] < F) and E[F.]=Fo. 
According to Theorem 5.4, (7.23) does hold if the F* are uniformly bounded. 
Call the policy bounded by M (M nonrandom) if 


(7.24) O<F*<M, n=0,1,2,.... 


If F,* is not bounded above, the gambler’s adversary must have infinite 
capital. A negative F* represents a debt, and if F* is not bounded below, 
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the gambler must have a patron of infinite wealth and generosity from whom 
to borrow and so must in effect have infinite capital. In case F* is bounded 
below, 0 is the convenient lower bound—the gambler is assumed to have in 
hand all the capital to which he has access. In any real case, (7.24) holds and 
(7.23) follows. (There is a technical point that arises because the general 
theory of integration has been postponed: F, must be assumed to have finite 
range so that it will be a simple random variable and hence have an expected 
value in the sense of Section 5.') The argument has led to this result: 


Theorem 7.2. For every policy, (7.21) holds if p < + and (7.22) holds if p = 
>. If the policy is bounded (and F, has finite range), then E| F.] < F, for p < + 
and E| F,|=F, for p=}. 


Example 7.6. The gambler has initial capital a and plays at unit stakes 
until his capital increases to c (0 <a <c) or he is ruined. Here Fy =a and 
W,=1, and so F,=a+S,. The policy is bounded by c, and F is c or 0 
according as the gambler succeeds or fails. If p = 4 and if s is the probability 
of success, then a = Fy = E[F_]=sc. Thus s =a/c. This gives a new deriva- 
tion of (7.7) for the case p = +. The argument assumes however that play is 

1 


certain to terminate. If p< >, Theorem 7.2 only gives s <a /c, which is 
weaker than (7.7). a 


Example 7.7. Suppose as before that Fy) =a and W, = 1, so that F,=a+ 
S„, but suppose the stopping rule is to quit as soon as F, reaches a + b. Here 
F% is bounded above by a+b but is not bounded below. If p= +, the 
gambler is by (7.8) certain to achieve his goal, so that F_=a +b. In this case 
F;¿=a <a +b = E[|F,]. This illustrates the effect of infinite capital. It also 


illustrates the need for uniform boundedness in Theorem 5.4 (compare 
Example 5.7). a 


For some other systems (gamblers call them “martingales”), see the 


problems. For most such systems there is a large chance of a small gain and a 
small chance of a large loss. 


Bold Play * 


The formula (7.7) gives the chance that a gambler betting unit stakes can 
increase his fortune from a to c before being ruined. Suppose that a and c 
happen to be even and that at each trial the wager is two units instead of 


"See Problem 7.11. 
“This topic may be omitted. 
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one. Since this has the effect of halving a and c, the chance of success is now 


p*/?—1 _p*-1 p> +1 


q 
ee ewe -A —=p#F#il. 
oT ad Os Ae ees Ie 


If p>1 (p< 4), the second factor on the right exceeds 1: Doubling the 
stakes increases the probability of success in the unfavorable case p > 1. In 
the case p = 1, the probability remains the same. 

There is a sense in which large stakes are optimal. It will be convenient to 
rescale so that the initial fortune satisfies 0 < Fọ < 1 and the goal is 1. The 
policy of bold play is this: At each stage the gambler bets his entire fortune, 
unless a win would carry him past his goal of 1, in which case he bets just 
enough that a win would exactly achieve that goal: 


(It is convenient to allow even irrational fortunes.) As for stopping, the policy 
is to quit as soon as F, reaches 0 or 1. 

Suppose that play has not terminated by time k — 1; under the policy 
(7.25), if play is not to terminate at time k, then X, must be +1 or —1 
according as F,_, < 4 or F,_, > 4, and the conditional probability of this is 
at most m = max{p, q}. It follows by induction that the probability that bold 
play continues beyond time n is at most m”, and so play is certain to 
terminate (7 is finite with probability 1). 

It will be shown that in the subfair case, bold play maximizes the probabil- 
ity of successfully reaching the goal of 1. This is the Dubins—Savage theorem. 
It will further be shown that there are other policies that are also optimal in 
this sense, and this maximum probability will be calculated. Bold play can be 
substantially better than betting at constant stakes. This contrasts with 
Theorems 7.1 and 7.2 concerning respects in which gambling systems are 
worthless. 

From now on, consider only policies m that are bounded by 1 (see (7.24)). 
Suppose further that play stops as soon as F, reaches 0 or 1 and that this is 
certain eventually to happen. Since F assumes the values 0 and 1, and since 
[F, =x] = U%_ot =n) O[F, =x] for x =0 and x =1, F, is a simple random 
variable. Bold play is one such policy m. 

The policy m leads to success if F, = 1. Let Q(x) be the probability of 
this for an initial fortune F) =x: 


(7.26) Q,(x) =P[F.=1] for Fy =x. 
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Since F, is a function (Fo, X{(@),..., X,(@)) = V,(Fo, œ), (7.26) in ex- 
panded notation is 0,(x) = Plw: Yx a(x, @) = 1]. As m specifies that play 
stops at the boundaries 0 and 1, 


Q,(0)=0, Q,(1) =1, 


9 
net) 0<Q0,(%)s1, Usna 


Let Q be the Q_ for bold play. (The notation does not show the dependence 
of Q and Q_ on p, which is fixed.) 


Theorem 7.3. In the subfair case, Q (x) < Q(x) for all m and all x. 


Proor. Under the assumption p <4, it will be shown later that 
(7.28) Ox) = POC TOROGUN V E eae 


This can be interpreted as saying that the chance of success under bold play 
starting at x is at least as great as the chance of success if the amount f¢ is 
wagered and bold play then pursued from x +f in case of a win and from 
x —f in case of a loss. Under the assumption of (7.28), optimality can be 
proved as follows. 

Consider a policy m, and let F, and F* be the simple random variables 
defined by (7.14) and (7.19) for this policy. Now Q(x) is a real function, and 
so Q(F*) is also a simple random variable; it can be interpreted as the 
conditional chance of success if m is replaced by bold play after time n. By 
(7.20), F* =x + X, if FY, =x and W,” =f. Therefore: 


Q( Fy) a die oe w*-Q(x ag E 
TAL 


where x and ż¢ vary over the (finite) ranges of F*_, and W,*, respectively. 
For each x and ¢, the indicator above is measurable Y,_, and Q(x + tX,) 
is measurable o(X,,); since the X,, are independent, (5.25) and (5.17) give 


(7.29)  E[Q(F*)] = L P[F*, =x, W* =r] E[Q(x +4x,)] 


By (7.28), E[O(x + tX,,)] < Q(x) if O<x—t<x<x+t<1. As it is assumed 
of m that F* lies in [0,1] (that is, W* < min{F* ,,1 — F*_,}), the probability 
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in (7.29) is 0 unless x and ¢ satisfy this constraint. Therefore, 
ELOU] = LPL RL =) WP = FQ *) 
xt 
= PPL RY, =*JO(x) =E[OC FA). 


This is true for each n, and so E[Q(FE*)] < ELQ(F)] = OCF). Since QCF*) 
= Q(F.) for n>7, Theorem 5.4 implies that E[QCF,)] < QUFo). Since x =1 
implies that Q(x) = 1, P[F, = 1] < E[Q(F,)] < Q(F)). Thus Q, (Fo) < O(F,) 
for the policy m, whatever F, may be. 

It remains to analyze Q and prove (7.28). Everything hinges on the 
functional equation 


O<x 
(7.30) pe ) 

D tgo =m ae = 
For x = 0 and x = 1 this is obvious because Q(0) = 0 and Q(1) = i. The idea 
is this: Suppose that the initial fortune is x. If x < Z, the first stake under 
bold play is x; if the gambler is to succeed in reaching 1, he must win the first 
trial (probability p) and then from his new fortune x +x=2x go on to 
succeed (probability Q(2x)); this makes the first half of (7.30) plausible. If 
x > 4, the first stake is 1 — x; the gambler can succeed either by winning the 
first trial (probability p) or by losing the first trial (probability q) and then 
going on from his new fortune x -— (1 —x) =2x— 1 to succeed (probability 
Q(2x — 1)); this makes the second half of (7.30) plausible. 

It is also intuitively clear that Q(x) must be an increasing function of x 
(0 <x <1): the more money the gambler starts with, the better off he is. 
Finally, it is intuitively clear that Q(x) ought to be a continuous function of 
the initial fortune x. 


A formal proof of (7.30) can be constructed as for the difference equation (7.5). If 
B(x) is x for x < 5 and 1 — x for x > 4, then under bold play W, =B(F,_,). Starting 
from fọ(x)=x, recursively define 


i AAR Gace Ag) = Sa (ea eye ae LG Pe Gaps: fs) 
Then F, =f,,( Fo; X1,-.-, Xn). Now define 


AEE TERRE Xe) ae MSN fle Bins 2. ees 
O<k<n 


If Fy =x, then 7,(x) =[g,(x; X,,..., Xa) = 1] is the event that bold play will by time 
n successfully increase the gambler’s fortune to 1. From the recursive definition it 
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follows by induction on n that for ni, f,(x;x),...,x,) =f, 1% + Ba; 
Bh x,) and hence that g,(x;x,,..., x,) =max(x, g,_ (x + B(x) x15 25...) ¥,)). 
Since x = 1 implies g,,_ (x + B(x)x,; X2,...,%,) =x + B(x)x, = 1, T(x) =1e,_ e+ 
B(x) X45 GON X,)= 1], and since the X, are independent and identically 
distributed, P(T,(x)) = P(X, = +119 T,(x)) + P(X, = -1] ^A T(x) = 
pPL8,—\(x + BCX); Xai X,) = 1) + GPL e,,_ (x = BO) X5,.1., X%,) =P, eee 
B(x) + aP(T, Ce — BCx))). Letting mn >œ now gives Q(x) = pQ(x + B(x)) 
+qQ(x — B(x)), which reduces to (7.30) because Q(0) = 0 and Q(1) = 1. 

Suppose that y =f, (x; x),...,x, 4) is nondecreasing in x. If x, = +1, then 
F(x yew x, is 2y if0 <y <3 and 1 if}<y <1; if x, = —1, then f(x; x,,..-,2,) 
is 0 if O<y<3 and 2y—1 if +<y<l. In any case, fA; ive +25 8,) 1 aB 
nondecreasing in x, and by induction this is true for every n. It follows that the same 
is true “OF Bays x,,), of P(T,(x)), and of Q(x). Thus Q(x) is nondecreasing. 

Since Q(1) = 1, (7.30) implies that QG) = pQ() =p, QG) = pQG) =p?, QG) = 
p+ gs) =p + pq. More generally, if pọ =p and Pp, =q, then 


3 k a) ee | 
22 o(a) Z|. Bugs 2 St < gail) Ores 2a 
i=l 
the sum extending over n-tuples (u;,...,u„) of 0’s and 1’s satisfying the condition 


indicated. Indeed, it is easy to see that (7.31) is the same thing as 


(7.32) O( .u,...u, 12°") [OCU ede ee 


n 


for each dyadic rational .u,...u,, of rank n. If .u,...u, +27" < 5, then u, = 0 and by 
(7.30) the difference in (7.32) is pol QCu,...u, +2-"*')—QC.u,...u,)]. But (7.32) 
follows inductively from this and a similar relation for the case .u,...u, > 4 


Teor 
Therefore Q(k2~") — Q(k — 1)2~") is bounded by max{p”, q”}, and so by mono- 
tonicity Q is continuous. Since (7.32) is positive, it follows that Q is strictly increasing 
over [0, 1]. 


Thus Q is continuous and increasing and satisfies (7.30). The inequality 
(7.28) is still to be proved. It is equivalent to the assertion that 


A(r,s) =Q(a) — pQ(s) —qQ(r) 20 


if 0<r<s <1, where a stands for the average: a = (r +s). Since Q is 
continuous, it suffices to prove the inequality for r and s of the form k/2", 
and this will be done by induction on n. Checking all cases disposes of n = 0. 
Assume that the inequality holds for a particular n, and that r and s have 
the form k/2”*!. There are four cases to consider. 


Case 1. s< +. By the first part of (7.30), A(r, s) = pA(r, 2s). Since 2r and 
2s have the form k/2", the induction hypothesis implies that A(2r,2s)> 0. 
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Case 2. 5 <r. By the second part of (7.30) 


A(r,s) =qA(2r-1,2s—1) 20. 


1 
Case 3. r<a<ss 


By (7.30), 


A(r,s) =pQ(2a) -p| p + qQ(2s - 1)] 


—q[ pQ(2r)]. 
From i<s<r+s=2a<1, follows Q(2a)=p + qQ(4a — 1); and from 
0<2a—i<}, follows Q(2a — +)=pQ(4a — 1). Therefore, pQ(2a) = =p’ + 
qQ(2a — >), and it follows that 


A(r,s) =q[Q(2a — 4) - pQ(2s - 1) —pQ(2r)] 


Since p <q, the right side does not increase if either of the two p’s is 
changed to q. Hence 


A(r,s) =>qmax[A(2r,2s—1),A(2s —1,2r)]. 


The induction hypothesis applies to 2r < 2s — 1 or to 2s — 1 < 2r, as the case 
may be, so one of the two A’s on the right is nonnegative 
Case 4. r< ł<a <s. By (7.30), 


A(r,s) =pq + qQ(2a — 1) — paQ (2s — 1) — pQ (2r) 
From 0< = -1s 


r +s -1 < 4, follows Q(2a — 1)=pQ(4a — 2); and from 


5<2a-—4=rt+s—f<1, follows Q(2a — 3)=p+qQ(4a — 2). Therefore 
qOQa —1)= poea d +) — p?, 


and it oles that 


A(r,s) =p|q—p + Q(2a - 3) -4Q(2s — 1) — qQ(2r)] 
If 2s — 1 < 2r, the right side here is 


p|(4-p)(1-Q(2r)) + A(2s —1,2r)] >0 
If 2r < 2s — 1, the right side is 


p|(a -p)(1 - Q(2s - 1)) + A(2r,2s—1)] > 0 


This completes the proof of (7.28) and hence of Theorem 7.3 
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The equation (7.31) has an interesting interpretation. Let Z,,Z,,... be 
_ independent random hn satisfying P[Z,, = 0] Pps p añd PIZ, =1]= 
Pp, = 4. Arom P[Z,=1 io]=1 and 2i s p2 52°" it follows that 
Mr; ,Z,2°'<k2- n] < Pen Z 2° <k”) a PE ana 3 k2 S 
Eby (7.31) the middle term is Q(k2~"), 


(7.33) Q(x) -P| Ezaz 


holds for dyadic rational x and hence by continuity holds for all x. In Section 
31, Q will reappear as a continuous, strictly increasing function singular in 
the sense of Lebesgue. On p. 408 is a stapi ua the case po = .25. 

Note that Q(x) = x in the fair case p = >. In fact, for a bounded policy 
Theorem 7.2 implies that E[F_] =F, in the fea case, and if the policy is to 
stop as soon as the fortune reaches 0 or 1, then the chance of successfully 
reaching 1 is P[F, = 1] = E[ F_] = Fp. Thus in the fair case with initial fortune 
x, the chance of success is x for every policy that stops at the boundaries, 
and x is an upper bound even if stopping earlier is allowed. 


Example 7.8. The gambler of Example 7.1 has capital $900 and goal 
$1000. For a fair game (p = 4) his chance of success is .9 whether he bets 
unit stakes or adopts bold play. At red-and-black (p = 5), his chance of 
success with unit stakes is .00003; an approximate calculation based on (7.31) 
shows that under bold play his chance Q(.9) of success increases to about .88, 
which compares well with the fair case. a 


Example 7.9. In Example 7.2 the capital is $100 and the goal $20,000. At 
unit stakes the chance of successes is .005 for p= 4 and 3X 107°! for 
p = Ë. Another approximate calculation shows that bold play at red-and-black 
gives the gambler probability about .003 of success, which again compares 
well with the fair case. 

This example illustrates the point of Theorem 7.3. The gambler enters the 
casino knowing that he must by dawn convert his $100 into $20,000 or face 
certain death at the hands of criminals to whom he owes that amount. Only 
red-and-black is available to him. The question is not whether to gamble—he 
must gamble. The question is how to gamble so as to maximize the chance of 
survival, and bold play is the answer. a 


There are policies other than the bold one that achieve the maximum 
SUCCESS probability Q(x). Suppose that as long as the gambler’ s fortune x is 
less than 5 he bets x for x < 1 and 4 — x for ; <x < 3. This is, in effect, the 


108 PROBABILITY 


bold-play strategy scaled down to the interval [0,4], and so the chance he 
ever reaches 5 is Q(2x) tor an initial fortune of x. Suppose IEI that if he 
does reach the goal of 5, or if he starts with fortune at least 3 5 in the first 
place, then he continues, but with ordinary bold play. For an initial fortune 
x > Ł, the overall chance of success is of course Q(x), and for an initial 
fortune x < 4, it is Q(2x)Q(4) = pQ@x) = Q(x). The success probability is 
indeed Q(x) as for bold play, although the policy is different. With this 
example in mind, one can generate a whole series of distinct optimal policies. 


Timid Play * 


The optimality of bold play seems reasonable when one considers the effect 
of its opposite, timid play. Let the e-timid policy be to bet W,= 
min {e, F,_,,1 — F,_,} and stop when F, reaches 0 or 1. Suppose that p <q, 
fix an initial fortune x =F, with 0<x <1, and consider what happens as 
€ > 0. By the strong law of large numbers, lim, n~ 'S, = ELX,]=p—q <0. 
There is therefore probability 1 that sup, S, < and lim, S, = —%®. Given 
n > 0, choose € so that Pl[sup,(x + €S,) < 11> 1- n. Since P(U?,_,[x +S, 
< 0]) = 1, with probability at least 1 — ņ there exists an n such that x + eS, 
< 0 and max, < „(x + €S,) < 1. But under the e-timid policy the gambler is in 
this circumstance ruined. If Q.(x) is the probability of success under the 
e-timid policy, then lim, „o Q.(x) = 0 for 0 <x < 1. The law of large numbers 
carries the timid player to his ruin. 


PROBLEMS 


7.1. A gambler with initial capital a plays until his fortune increases b units or he is 
ruined. Suppose that p>1. The chance of success is multiplied by 1+ 0 
if his initial capital is infinite instead of a. Show that 0<@<(p*—1)'< 
(alp — 1))~'; relate to Example 7.3. 


7.2 As shown on p. 94, there is probability 1 that the gambler either achieves his 
goal of c or is ruined. For p +q, deduce this directly from the strong law of 
large numbers. Deduce it (for all p) via the Borel—Cantelli lemma from the fact 
that if play never terminates, there can never occur c successive +1’s. 


7.3. 6.127 If V, is the set of n-long sequences of +1’s, the function b, in (7.9) 
maps V,,_, into {0,1}. A selection system is a sequence of such maps. Although 
there are uncountably many selection systems, how many have an effective 


*This topic may be omitted. 
‘For each e, however, there exist optimal policies under which the bet never exceeds e; see 
Dusins & SAVAGE. 
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— 


7.6. 


iif 


7.8. 


description in the sense of an algorithm or finite set of instructions by means of 
which a deputy (perhaps a machine) could operate the system for the gambler? 
An analysis of the question is a matter for mathematical logic, but one can see 
that there can be only countably many algorithms or finite sets of rules 
expressed in finite alphabets. 

Let Y{°, Y$™,... be the random variables of Theorem 7.1 for a particular 
system øg, and let C, be the w-set where every k-tuple of +1’s (k arbitrary) 
occurs in Y{7(w), YL%w),... with the right asymptotic relative frequency (in 
the sense of Problem 6.12). Let C be the intersection of C, over all effective 
selection systems ø. Show that C lies in Z (the o-field in the probability space 
(Q, F,P) on which the X, are defined) and that P(C)=1. A sequence 
(X\(w),X,(@),...) for w in C is called a collective: a subsequence chosen by any 
of the effective rules ø contains all k-tuples in the correct proportions. 


. Let D, be 1 or 0 according as X,,,_, #X,, or not, and let M, be the time of 


the Ath 1—the smallest n such that £7_,D,=k. Let Z, =X ,,. In other 
words, look at successive nonoverlapping pairs (X,,,_ p Xn), discard accordant 
(X,,,_; =X>,,) pairs, and keep the second element of discordant (X,,_, # X>,,) 
pairs. Show that this process simulates a fair coin: Z}, Z2,... are independent 
and identically distributed and P[Z, = + 1] =P[Z, = —1]= 4, whatever p may 


be. Follow the proof of Theorem 7.1. 


. Suppose that a gambler with initial fortune 1 stakes a proportion 6 (0 < 6 < 1) 


of his current fortune: Fy = 1 and W, = 0F,_,. Show that F, = Mg- + @X;,) 
and hence that 


ri bo 1+0 2 
log 7 = a log za™@ t los(l — 8 iz 


Show that F, > 0 with probability 1 in the subfair case. 


In “doubling,” W, =1, W, =2W,,_, and the rule is to stop after the first win. 
For any positive p, play is certain to terminate. Here F, = F, + 1, but of course 
infinite capital is required. If F,=2*-1 and W, cannot exceed F,_,, the 
probability of F, = F, + 1 in the fair case is 1 — 2~*. Prove this via Theorem 7.2 


and also directly. 


In “progress and pinch,” the wager, initially some integer, is increased by 1 
after a loss and decreased by 1 after a win, the stopping rule being to quit if the 
next bet is 0. Show that play is certain to terminate if and only if p > +. Show 
that F, =F) + W° + 5(7 — 1). Infinite capital is required. 


Here is a common martingale. Just before the nth spin of the wheel, the 
gambler has before him a pattern x),..., Xk of positive numbers (k varies with 
n). He bets x, +X,, Or X, in case k = 1. If he loses, at the next stage he uses the 
<s Xp, X +X, (41, x; in case k = 1). If he wins, at the next stage he 


attern Xia. ‘ z ; ; 
i ; , unless k is 1 or 2, in which case he quits. Show 


uses the pattern X,..-,X,-1 
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that play is certain to terminate if p > + and that the ultimate gain is the sum of 
the numbers in the initial pattern. Infinite capital is again required. 


Suppose that W,=1, so that F, =F) + Są. Suppose that p>q and 7 is a 
stopping time such that 1 <7 <n with probability 1. Show that E[F,] < E[F,], 
with equality in case p =q. Interpret this result in terms of a stock option that 
must be exercised by time n, where Fy + S, represents the price of the stock at 
time k. 


For a given policy, let A* be the fortune of the gambler’s adversary at time n. 
Consider these conditions on the policy: (i) W* < F*_,; (ii) W* < A*_,; (iii) 


F* +A% is constant. Interpret each condition, and show that together they 
imply that the policy is bounded in the sense of (7.24). 


. Show that F, has infinite range if Fy = 1, W, = 2~", and 7 is the smallest n for 


which X,, = +1. 


Let u be a real function on [0, 1], u(x) representing the utility of the fortune x. 
Consider policies bounded by 1; see (7.24). Let Q (Fo) = Elu(F,)]; this repre- 
sents the expected utility under the policy m of an initial fortune Fy. Suppose of 
a policy mọ that 


(7.34) u(x) <Q, (x), 0<x<l1, 
and that 
(7.35) O,,(*) 2p0,,(4 +t) +4a0,(x-1), 


0<x—-—t<x<x+tt<1l., 


Show that O,(x) <Q, (x) for all x and all policies 7. Such a To is optimal. 

Theorem 7.3 is the special case of this result for p < 3 bold play in the role 
of To, and u(x) = 1 or u(x) =0 according as x = 1 or x <1. 

The condition (7.34) says that gambling with policy mọ is at least as good as 
not gambling at all; (7.35) says that, although the prospects even under To 
become on the average less sanguine as time passes, it is better to use To NOW 
than to use some other policy for one step and then change to To: 


The functional equation (7.30) and the assumption that Q is bounded suffice to 
determine Q completely, First, Q(0) and Q(1) must be 0 and 1, respectively, and 
so (7.31) holds. Let Tox = x and T,x = 3x + 3; let fox =px and fix =p + qx. 
Then Q(T, 7t Tux) = fu, °° fua, Q(x). If the binary expansions of x and y 
both begin with the digits u,,...,u,, they have the form x = Tis. -.. T, x' and 
y=T, °°: T,,y'. If K bounds Q and if m=max{p,q}, it follows that 
|O(x) — Q(y)| < Km". Therefore, Q is continuous and satisfies (7.31) and (7.33). 
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As Markov chains illustrate in a clear and striking way the connection 
between probability and measure, their basic properties are developed here 
in a measure-theoretic setting. 


Definitions 


Let S be a finite or countable set. Suppose that to each pair i and j in § 


there is assigned a nonnegative number pi; and that these numbers satisfy 
the constraint 


(8.1) 2 D =1, Le §; 
JES 
Let Xo, Xi, Xz,... be a sequence of random variables whose ranges are 


contained in S. The sequence is a Markov chain or Markov process if 


(8.2) P[ X,41=/1Xo =i) ea 
=P, Xe =j\X,, =i,| = Pij 


for every n and every sequence io,...,i„ in S for which P[ Xj) = ips- --, X, = 
i ]> 0. The set S is the state space or phase space of the process, and the Pij 
are the transition probabilities. Part of the defining condition (8.2) is that the 
transition probability 


(8.3) P(X, 41 =J|X, =i] = Pij 


does not vary with n.t 

The elements of S are thought of as the possible states of a system, X, 
representing the state at time n. The sequence or process Xo, X,, X>,...then 
represents the history of the system, which evolves in accordance with the 
probability law (8.2). The conditional distribution of the next state X,,, 
given the present state X, must not further depend on the past Xo,..., Xn-1: 
This is what (8.2) requires, and it leads to a copious theory. 

The initial probabilities are 


(8.4) a, = P[ X,=i). 


The a, are nonnegative and add to 1, but the definition of Markov chain 
places no further restrictions on them. 


Sometimes in the definition of the Markov chain P[X,,,, =/|X,, = i] is allowed to depend on n. 
A chain satisfying (8.3) is then said to have stationary transition probabilities, a phrase that will be 
omitted here because (8.3) will always be assumed. 
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The following examples illustrate some of the possibilities. In each one, 
the state space S and the transition probabilities p,; are described, but the 
underlying probability space (Q, Z, P) and the X, are left unspecified for 


now: see Theorem 8.1." 


Example 8.1. The Bernoulli-Laplace model of diffusion. Imagine r black 
balls and r white balls distributed between two boxes, with the constraint 
that each box contains r balls. The state of the system is specified by the 
number of white balls in the first box, so that the state space isse (0,4).2. Jay 
The transition mechanism is this: at each stage one ball is chosen at random 
from each box and the two are interchanged. If the present state is i, the 
chance of a transition to i — 1 is the chance i/r of drawing one of the i white 
balls from the first box times the chance i/r of drawing one of the i black 
balls from the second box. Together with similar arguments for the other 
possibilities, this shows that the transition probabilities are 


Ta PSA O), 
e e a e 


the others being 0. This is the probablistic analogue of the model for the flow 
of two liquids between two containers. a 


The p,; form the transition matrix P =[p,,] of the process. A stochastic 
matrix is one whose entries are nonnegative and satisfy (8.1); the transition 
matrix of course has this property. 


Example 8.2. Random walk with absorbing barriers. Suppose that S = 
{0,1,....7)} and 


1 cae eal) en) OVO a0 SAG 
ip NN) yoy Oa Oia Oy a0 
pa |" eon E eee 
OF "ORS Ore) Gq Uap 40 
0” OF 0" 0 OF a on 
0, DO. 0e 0 O sO ak 
That is, P; ;,;=p and p; ;_,=q=1-—p for 0<i<r and Poo = P,, = 1. The 


chain represents a particle in random walk. The particle moves one unit to 
the right or left, the respective probabilities being p and q, except that each 
of 0 and r is an absorbing state—once the particle enters, it cannot leave. 
The state can also be viewed as a gambler’s fortune; absorption in 0 


"For an excellent collection of examples from physics and biology 


Chapter XV. see FELLER, Volume 1, 
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represents ruin for the gambler, absorption in r ruin for his adversary (see 
Section 7). The gambler’s initial fortune is usually regarded as nonrandom, so 
that (see (8.4)) a, = 1 for some i. u 


Example 8.3. Unrestricted random walk. Let § consist of all the integers 
i=0,+1,+2,...,and take pi Pana Pi i-1 =4 = 1 — p. This chain rep- 
resents a random walk without barriers, the particle being free to move 
anywhere on the integer lattice. The walk is symmetric if p =q. a 


The state space may, as in the preceding example, be countably infinite. If 
so, the Markov chain consists of functions X, on a probability space 
(Q, Z, P), but these will have infinite range and hence will not be random 
variables in the sense of the preceding sections. This will cause no difficulty, 
however, because expected values of the X, will not be considered. All that 


is required is that for each i € S the set [w: X,(w) =i] lie in Z and hence 
have a probability. 


Example 8.4. Symmetric random walk in space. Let S consist of the 
integer lattice points in k-dimensional Euclidean space R*; x = (3,02 mg) 
lies in S if the coordinates are all integers. Now x has 2k neighbors, points 
of the form y=(x,,...,x,+1,...,x,); for each sucht y let poy (aoe 
The chain represents a particle moving randomly in space; for k =1 it 
reduces to Example 8.3 with p =q = 4. The cases k < 2 and k > 3 exhibit an 
interesting difference. If k < 2, the particle is certain to return to its initial 


position, but this is not so if k > 3; see Example 8.6. a 


Since the state space in this example is not a subset of the line, the 
X,,X,,-.. do not assume real values. This is immaterial because expected 
values of the X,, play no role. All that is necessary is that X, be a mapping 
from Q into S (finite or countable) such that [w: X,(w) =i]e F for iE S. 
There will be expected values E[ f(X,,)] for real functions f on S with finite 
range, but then f(X,(w)) is a simple random variable as defined before. 


Example 8.5. A selection problem. A princess must chose from among r suitors. 
She is definite in her preferences and if presented with all r at once could choose her 
favorite and could even rank the whole group. They are ushered into her presence 
one by one in random order, however, and she must at each stage either stop and 
accept the suitor or else reject him and proceed in the hope that a better one will 
come along. What strategy will maximize her chance of stopping with the best suitor 
of all? 

Shorn of some details, the analysis is this. Let $,, S,,..., 8, be the suitors in order 
of presentation; this sequence is a random permutation of the set of suitors. Let 
X,=1 and let X,, X3,...be the successive positions of suitors who dominate (are 
preferable to) all their predecessors. Thus X,=4 and X,=6 means that S, domi- 
nates S, and S, but S, dominates S,, $2, $3, and that $4 dominates S; but Se 
dominates S,,...,55. There can be at most r of these dominant suitors; if there are 
exactly m, Xn ii An ae + meat by convention. 
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As the suitors arrive in random order, the chance that S; ranks highest among 
S,,..., S; is (i— 1)!/i! = 1/i. The chance that $, ranks highest among S,,...,5; and 
S; ranks next is (j — 2)!/j!=1/j(j — 1). This feads to a chain with transition probabili- 
ties" 


i ) 
(8.5) P{ Xn =X, == Gn < 


If X, =i, then X,,,, =r + 1 means that $; dominates Sj445+++>5, as well as $,,...,8 
and the conditional probability of this is 


D) 


(8.6) PiX r e a ae 


As downward transitions are impossible and r+ 1 is absorbing, this specifies a 
transition matrix for S = {1,2,...,r + 1}. 

It is quite clear that in maximizing her chance of selecting the best suitor of all, the 
princess should reject those who do not dominate their predecessors. Her strategy 
therefore will be to stop with the suitor in position X,, where 7 is a random variable 
representing her strategy. Since her decision to stop must depend only on the suitors 
she has seen thus far, the event [r =n] must lie in o(X,..., X,,). If X, =i, then by 
(8.6) the conditional probability of success is f(i) =i/r. The probability of success is 
therefore E[ f(X_)], and the problem is to choose the strategy 7 so as to maximize it. 
For the solution, see Example 8.17.7 B 


Higher-Order Transitions 


The properties of the Markov chain are entirely determined by the transition 
and initial probabilities. The chain rule (4.2) for conditional probabilities 
gives 


P| Xoin X,=1,, X,=i,] 
= PÍ X, = io ]P[X; = i| Xo = io] P[X, = i2| Xo = io X, =i] 
~ Qi Pigi Piin: 
Similarly, 


(8.7) P[X, =i,,0<t<m]=a, p,, sane?) | Pee 8 
for any sequence ip, i,,...,i,, of states. 
Further, 


(8.8) P[Xn4,=i;,1st<n|X,=i,,0<s<m] =Pi P tae 


The details can be found in DYNKIN & YusHKEVICH. Chapter III. 


* With the princess replaced by an executive and the suitors by applicants for an office job, this is 
known as the secretary problem 
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as follows by expressing the conditional probability as a ratio and applying 
(8.7) to numerator and denominator. Adding out the intermediate states now 
gives the formula 


(8.9) DP = P[X nan “l Xn =i 
a Pik, Pkiky +++ Pk, jj 
kiss 


n-1 


(the k, range over S) for the nth-order transition probabilities. 

Notice that py is the entry in position (i, j) of P”, the nth power of the 
transition matrix P. If S is infinite, P is a matrix with infinitely many rows 
and columns; as the terms in (8.9) are nonnegative, there are no convergence 
problems. It is natural to put 


pP =3,,= -a 
E v O W isej 


Then P° is the identity J, as it should be. From (8.1) and (8.9) follow 


(8.10) Dy = D Pi Pij » Lpi = 


An Existence Theorem 


Theorem 8.1. Suppose that P = Í p; jl is a stochastic matrix and that a; are 
nonnegative numbers satisfying Li; <5a@; = 1. There exists on some (Q, F, P) a 
Markov chain X}, X,, X2,.-. with initial probabilities a; and transition proba- 
bilities p;;. 


Proor. Reconsider the proof of Theorem 5.3. There the space (Q, F,P) 
was the unit interval, and the central part of the argument was the construc- 
tion of the decompositions (5.13). Suppose for the moment that S$ = A, 2, di E 
First construct a partition 1, 1,... of (0, 1] into countably many‘ subinter- 
vals of lengths (P is again Leberaue measure) P( I‘) = a;. Next decompose 
each J into subintervals J” of lengths PUS’) = a; pij- Continuing induc- 
tively gives a sequence of partitions (M 3 torss+sd_ = 1,2,.--) such that 
each refines the preceding and POUK” ; )=a ai, Pig, * +s * Dies 


Put X,(w) =i if o€ U HP i, agp It follows just as in the proof of 
ioe 5.3 that the set (ee ip- -+ Xn =!,] coincides with the interval 
ia ie . Thus PX) = = los “lore E> wi =] nl = Qio Pioi ° y: Pi, Jie From this it fol- 


lowe ae a that (8.4) holds and that the first and third members of 


‘If 5, +8, +--+- =b—aand 6,20, then I; = (b - Ej <:ôj b - E; <;ôj} i=1,2,..., decom 
(a, b] into we of lengths 6,. yom 


116 PROBABILITY 


(8.2) are the same. As for the middle member, it is P[X,=i,, X,4,= 
j)/P(X,, =i,]; the numerator is La; Piri, °** Pi,_\i,Piy the sum extending 
Over All tp... <5 i„— and the denominator is the same thing without the factor 
P; j which means that the ratio is p; j, as required. 

That completes the construction for the case § = {1,2,...}. For the gen- 
eral countably infinite S, let g be a one-to-one mapping of {1,2,...} onto S, 
and replace the X,, as already constructed by g(X,,); the assumption § = 
{1,2,...} was merely a notational convenience. The same argument obviously 


works if S is finite.’ | 


Although strictly speaking the Markov chain is the sequence Xj, 
Min cnnn one often speaks as though the chain were the matrix P together with 
the initial probabilities œ; or even P with some unspecified set of q;. 
Theorem 8.1 justifies this attitude: For given P and a; the corresponding X, 
do exist, and the apparatus of probability theory—the Borel—Cantelli lem- 
mas and so on—is available for the study of P and of systems evolving in 
accordance with the Markov rule. 

From now on fix a chain X), X,,... satisfying a; >0 for all i. Denote by P; 
probabilities conditional on [X, =i]: P(A) = P[A|X, =i]. Thus 


(8.11) PIX, =i St S| 0 ee 


tn—-1'n 


by (8.8). The interest centers on these conditional probabilities, and the 
actual initial probabilities a, are now largely irrelevant. 
From (8.11) follows 


(8.12), P[X; =i., A= Gee ee 
=P[|X, =i a A le I ae =j]. 


Suppose that 7 is a set (finite or infinite) of m-long sequences of states, J is a 
set of n-long sequences of states, and every sequence in J ends in j. Adding 


both sides of (8.12) for (i,,...,i,,) ranging over J and (j,,..., j„) ranging over 
J gives 
(8.13) P| (Xie Xen) Tp lin igen ae eee 


= PX +) Xp) PO 5 ee 


For this to hold it is essential that each sequence in / end in j. The formulas 
(8.12) and (8.13) are of central importance. 


‘For a different approach in the finite case, see Problem 8.1. 


SECTION 8. MARKOV CHAINS 117 


Transience and Persistence 


Let 
(8.14) = BIX E Xone | 


be the probability of a first visit to j at time n for a system that starts in i, 
and let 


tio) f,=P| UX, =;1) = EAP 


n=l n=] 


be the probability of an eventual visit. A state i is persistent if a system 
starting at 7 is certain sometime to return to i: f,, = 1. The state is transient 
in the opposite case: f; < 1. 

Suppose that n,,..., ną are integers satisfying 1<n,< --- <n, and 
consider the event that the system visits j at times n,...n, but not in 
between; this event is determined by the conditions 


IX | lo seer nai Fd xX =j, 


ny 
n n 


Xt | Ky 


A eri Pau ear A n el: X =j. 


n 


Repeated application of (8.13) shows that under P, the probability of this 
event is fpf E277 --- fimk"k-v, Add this over the K tuples n,,...,m,: the 
P-probability that X,=j for at least k different values of n is f,,f*~". 
Letting k — œ therefore gives 


0: af feats 


(8.16) Pix, -jio)={ 7 if f,=1, 


Recall that i.o. means infinitely often. Taking i =j gives 


0: iff, <1 


(8.17) PX, = iios r if f, =1.° 


Thus P{ X,, =ii.0.] is either 0 or 1; compare the zero—one law (Theorem 4.5), 
but note that the events [X, =i] here are not in general independent.t 


"See Problem 8.35. 
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Theorem 8.2. 


(i) Transience of i is equivalent to P| X, =ii.o.)=0 and to XL, pẹ” < œ. 
(ii) Persistence of i is equivalent to P| X,,=ii.o.]=1 and to L, p” = œ, 


Proor. By the first Borel—Cantelli lemma, ©, p{/” < © implies P[ X, =i 
i.o.] = 0, which by (8.17) in turn implies f; < 1. The entire theorem will be 
proved if it is shown that f, <1 implies L,, p{/? < œ. 

The proof uses a first-passage argument: By (8.13), 


n—1 


pi” = EA =j] = bm PIX, FJ 5.00) eget EJ Ana feel A =j] 
s=0 
n—1 
= B P[X, Pjan Ansi eine =j]P[X, = 
s=0 
n—1 
Fa M i, De 
s=0 
Therefore, 
n n t—1 
PS ee 
t=1 t=1s=0 
A= Íl n n 
= Eppa f T U X poies 
s=0 t=s+1 s=0 


Thus (1 — fX- pP <f,,, and if f; <1, this puts a bound: one the partial 
sums 77 _, pt. : a 


Example 8.6. Polya’s theorem. For the symmetric k-dimensional random 
walk (Example 8.4), all states are persistent if k=1 or k= 2, and all states 
are transient if k > 3. To prove this, note first that the probability py? of 
return in n steps is the same for all states i; denote this probability by a‘? to 
indicate the dependence on the dimension k. Clearly, aS. , = 0. Suppose 
that k = 1. Since return in 2n steps means n steps east and n steps west, 


2n)\ 1 
as =| ri |= 


By Stirling’s formula, a$} ~ (mn) !⁄. Therefore, £ „a® = œ, and all states 
are persistent by Theorem 8.2. i i 
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In the plane, a return to the starting point in 2n steps means equal 
numbers of steps east and west as well as equal numbers north and south: 


4 (2n)! 1 
in = L ulul(n-u)!(n-u)! 42" 


naala a lulla tu) 


It can be seen on combinatorial grounds that the last sum is 4"), and so 
aS) = (a$?)? ~ (arn) |. Again, ,,a@ = œ and every state is persistent. 
For three dimensions, 


(2n)! T 
ululolol(n —u —v)\ (7 =e Dae 


a= È 


the sum extending over nonnegative u and v satisfying u +v <n. This 
reduces to 


8.18) a= > IN A2 er a2 
( : 2n = Dy] 3 3 2n—21— 219 


as can be checked by substitution. (To see the probabilistic meaning of this 
formula, condition on there being 2n — 2l steps parallel to the vertical axis 
and 2/ steps parallel to the horizontal plane.) It will be shown that a = 
O(n~3/2), which will imply that L,,a© < œ. The terms in (8.18) for / = 0 and 
l=n are each O(n~?/”) and hence can be omitted. Now a® < Ku~'/? and 
a?) < Ku ', as already seen, and so the sum in question is at most 


EEE Geran en 


Since (2n — 21)7}/2 < 2n!” (2n = 21)" < 4n! (2n — 214+ 1)7! and 2D ' < 
2(21 + 1)~', this is at most a constant times 


(2n)! Nak One? (apr (3) Senay, 
” Qn42)! & \ 21-143 3 


Thus £„a® < œ, and the states are transient. The same is true for k = 4,5,..., 
since an inductine extension of the argument shows that a‘ = O(n~* / 2), B 


It is possible for a system starting in / to reach j (fi; > 0) if and only if 
p\?>0 for some n. If this is true for all ¢ and j, the Markov chain is 


irreducible. 
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Theorem 8.3. If the Markov chain is irreducible, then one of the following 
two alternatives holds. 


(i) All states are transient, P(U,(X,, =j i.o.]) = 0 for alli, and £, pi? < œ 
for all i and j. 

(ii) All states are persistent, P(N; [X,„ =J io.) = 1 for alli, and £, p\? = œ 
for all i and j. 


The irreducible chain itself can accordingly be called persistent or tran- 
sient. In the persistent case the system visits every state infinitely often. In 
the transient case it visits each state only finitely often, hence visits each 
finite set only finitely often, and so may be said to go to infinity. 


Proor. For each i and j there exist r and s such that pf? > 0 and 
pi? > 0. Now 


(8.19) por +n) > PPPD 


and from pẹpẹ > 0 it follows that ©, pf” < implies £ nD if one 
state is aset they all are. In this case (8.16) gives P| X,, =j i.o.]= 0 fos 
all i and j, so that PAU {X, =j io.))=0 for all i. Since LR_, pi? = 
ay ee ee o_O _ oP” < Le,-op\”, it follows that if j is 
transient, then (Theorem 8.2) ©, yy? converges for every i. 
The other possibility is that all states are persistent. In this case P| X, =j 
i.o.] = 1 by Theorem 8.2, and it follows by (8.13) that 


PY = P([Xm =i] O[X, =) i.0.]) 


< y P| Xn = A CRPE E pea ae X, =I] 
n>m 

5 y pi aE m = Df 
n>m 


There is an m for which pẸ™ > 0, and therefore f;, i= 1. By (8.16), PIX, =J 


i.o.)=f,,=1. If L,, piP were to converge for some i and j, it would follow by 
the first Borel-Cantelli lemma that P(X, =j i.o.] = 0. m 


Example 8.7. Since Ł,p{P = 1, the first alternative in Theorem 8.3 is 
impossible if S is finite: a finite, irreducible Markov chain is persistent. a 


Example 8.8. The chain in Pólya’s theorem is certainly irreducible. If the 
dimension is 1 or 2, there is probability 1 that a particle in symmetric random 
walk visits every state infinitely often. If the dimension is 3 or more, the 
particle goes to infinity. Po 
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Example 8.9. Consider the unrestricted random walk on the line (Exam- 
ple 8.3). According to the ruin calculation (7.8), fo, =p/q for p <q. Since 
the chain is irreducible, all states are transient. By symmetry, of course, the 
chain is also transient if p >q, although in this case (7.8) gives fp, = 1. Thus 
f,, = 1 (i #j) is possible in the transient case.t 

If p =q = į}, the chain is persistent by Pélya’s theorem. If n and j —i have 
the same parity, 


$ ] 
p= | njai 5 |j—iļ<n 
ij GAT, A Ei J <n. 
2 


This is maximal if j =i or j=i+ 1, and by Stirling’s formula the maximal 


value is of order n~'/*. Therefore, lim, p{” = 0, which always holds in the 
transient case but is thus possible in the persistent case as well (see Theorem 
8.8). ial 


Another Criterion for Persistence 


Let Q =[q;,,] be a matrix with rows and columns indexed by the elements of a 
finite or countable set U. Suppose it is substochastic in the sense that q; jz0 
and 2) .gi,< 1. Let 0" = [q{"] be the nth power, so that 


(8.20) ge = Vavap, afP=6,,. 


Consider the row sums 


(8.21) a= Da. 
j 


From (8.20) follows 


(8.22) Op) EVE aig. 
j 


Since Q is substochastic ø{} <1, and hence of) =F P = 
vga” < o{”. Therefore, the monotone limits 


(8.23) g; = lim Lan 
ca 


"But for each j there must be some i +j for which f,; < 1; see Problem 8.7. 
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exist. By (8.22) and the Weierstrass M-test [A28], o; = L)q,;0;. Thus the o, 
solve the system 


ne 2, ati WET, 
(8.24) jeu 
0<x, <1, ieU. 
For an arbitrary solution, x; = £q; x; < L,q;; =o, and x, < of” for all i 
implies x; < £,q,;0," =o{"*" by (8.22). Thus x; <o” for all n by induc- 
tion, and so x; <o;. Thus the g; give the maximal solution to (8.24): 


Lemma 1. For a substochastic matrix Q the limits (8.23) are the maximal 
solution of (8.24). 


Now suppose that U is a subset of the state space S. The p;; for i and j in 
U give a substochastic matrix Q. The row sums (8.21) are o/” = Lpj; Pij 


“Pj, up Where the j,,...,j, range over U, and so of” =P{ X, €U, fn} 
Let n > ©; 
(8.25) 6 =Pl[xX GUL=12 1) een 


In this case, g; is thus the probability that the system remains forever in U, 
given that it starts at i. The following theorem is now an immediate 
consequence of Lemma 1. 


Theorem 8.4. For U CSS the probabilities (8.25) are the maximal solution 
of the system 


xj LE PijXj LeU, 
(8.26) fed 
Desog, S Ib, LeU: 


The constraint x; >0 in (8.26) is in a sense redundant: Since x,=0.is a 
solution, the maximal solution is automatically nonnegative (and similarly for 
(8.24)). And the maximal solution is x, = 1 if and only if £ jeuPi = 1 for all i 
in U, which makes probabilistic sense. 


Example 8.10. For the random walk on the line consider the set U = 
{0,1,2,...}. The System (8.26) is 


Xi =PXi4, +X- izl, 
Xo = PX}. 


It follows [A19] that x, =A +An if p =q and X,=A-—A(q/p)"*' if p +q. 
The only bounded solution is x, =0 if q2p, and in this case there is 
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probability 0 of staying forever among the nonnegative integers. If q <p, 
A =I gives the maximal solution x, =1-—(q/p)"t! (and 0<A <1 gives 
exactly the solutions that are not maximal). Compare (7.8) and Example 8.9. 

a 


Now consider the system (8.26) with U = § — {i,} for an arbitrary single 
state ip: 


x, = 2e DijXj5 i*i 
(8-21) j#ig 
US%, S 1, i+ ig. 


There is always the trivial solution—the one for which x, = 0. 


Theorem 8.5. An irreducible chain is transient if and only if (8.27) has a 
nontrivial solution. 


Proor. The probabilities 


(8.28) 1-f, =P[X,#i),n2=1], iF ip, 


iig i 


are by Theorem 8.4 the maximal solution of (8.27). Therefore (8.27) has a 
nontrivial solution if and only if f; <1 for some i#iy. If the chain is 
persistent, this is impossible by Theorem 8.3(ii). 

Suppose the chain is transient. Since 


fining = Pigh X1 =io] + p5 Ds P [X1 =i, Xz Fig,...,X__1 Fig, X, =i] 


n=2 iti, 


=P n © b Pafi 


i*io 
and since f; ;, < 1, it follows that fi; < 1 for some i # io. = 


Since the equations in (8.27) are homogeneous, the issue is whether they 
have a solution that is nonnegative, nontrivial, and bounded. If they do, 


0 <x, < 1 can be arranged by rescaling.’ 


"See Problem 8.9. 
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Example 8.11. In the simplest of queueing models the state space is 
{0,1,2,...} and the transition matrix has the form 


e ‘oe fe 'o @ “eo & fo “a “e' ©) “SRG Uae 6 SU eee ome 


If there are i customers in the queue and i > 1, the customer at the head of 

the queue is served and leaves, and then 0, 1, or 2 new customers arrive 

(probabilities t,,t,,t,), which leaves a queue of length i—1, i, or i +1. If 

i = 0, no one is served, and the new customers bring the queue length to 0, 1, 

or 2. Assume that tọ and tf, are positive, so that the chain is irreducible. 
For i, = 0 the system (8.27) is 


X,=,X, +t,X>, 


8.29 
( ) X= otr i TENE HRTAN k22 


Since to, t}, t» have the form q4(1 — t), t, p(1 — t) for appropriate p,q, t, the 
second line of (8.29) has the form x, =px,,,+4qx,_,, k>2. Now the 
solution [A19] is A + B(q/p)< =A + B(ty/t,)* if ty #t, (p +q) and A + Bk 
if t) =t, (p=q), and A can be expressed in terms of B because of the first 
equation in (8.29). The result is 


me BAC =a fait, 
“| Bk if to = t3. 


There is a nontrivial solution if tọ < t, but not if tọ > t,. 

If to < t, the chain is thus transient, and the queue size goes to infinity 
with proability 1. If tọ > ¢,, the chain is persistent. For a nonempty queue the 
expected increase in queue length in one step is t2 — ty, and the queue goes 
out of control if and only if this is positive. a 


Stationary Distributions 
Suppose that the chain has initial probabilities m; Satisfying 


(8.30) L m pj=m, jes. 


iES 
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It then follows by induction that 


(8.31) E 7 pP= m0 FES; | AO lyaymunt 
ies 

If m; is the probability that X)=i, then the left side of (8.31) is the 

probability that X,, =j, and thus (8.30) implies that the probability of [X, =j] 

is the same for all n. A set of probabilities satisfying (8.30) is for this reason 

called a stationary distribution. The existence of such a distribution implies 

that the chain is very stable. 

To discuss this requires the notion of periodicity. The state j has period t 
if pẹ >0 implies that ¢ divides n and if ¢ is the largest integer with this 
property. In other words, the period of j is the greatest common divisor of 
the set of integers 


(8.32) [Ane 1, po 


If the chain is irreducible, then for each pair i and j there exist r and s 
for which p{? and pẹ’ are positive, and of course 


(8.33) ai) =D. pope 


Let t; and t; be the periods of i and j. Taking n = 0 in this inequality shows 
that ż; divides r + s; and now it follows by the inequality that p% > 0 implies 
that t; divides r +s +n and hence divides n. Thus t; divides each integer in 
the set (8.32), and so t,;<t,;. Since i and j can be interchanged in this 
argument, i and j have the same period. One can thus speak of the period of 
the chain itself in the irreducible case. The random walk on the line has 


period 2, for example. If the period is 1, the chain is aperiodic 


Lemma 2. Jn an irreducible, aperiodic chain, for each i and j, pi > 0 for 
all n exceeding some n4(i, j). 


Proor. Since p("*” > pp, if M is the set (8.32) then m EM and 
n E M together imply m +n € M. But it is a fact of number theory [A21] that 
if a set of positive integers is closed under addition and has greatest common 
divisor 1, then it contains all integers exceeding some n,. Given i and j, 
choose r so that pf? > 0. If n > no =n, +r, then p{ > pipi” > 0. a 


Theorem 8.6. Suppose of an irreducible, aperiodic chain that there exists a 
stationary distribution—a solution of (8.30) satisfying 7;>0 and Lm = 1. 
Then the chain is persistent, 


(8.34) lim piP = 7; 


for all i and j, the m; are all positive, and the stationary distribution is unique. 
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The main point of the conclusion is that the effect of the initial state wears 
off. Whatever the actual initial distribution {a,} of the chain may be, if (8.34) 
holds, then it follows by the M-test that the probability Lja; pip of [X, =j] 
converges to T;. 


Proor. Ifthe chain is transient, then pẹ” — 0 for all i and j by Theorem 
8.3, and it follows by (8.31) and the M-test that 7, is identically 0, which 
contradicts L,7,;=1. The existence of a stationary distribution therefore 
implies that the chain is persistent. 

Consider now a Markov chain with state space §X S and transition 
probabilities p(y, kl) = p,, Pj (it is easy to verify that these form a stochastic 
matrix). Call this the coupled chain; it describes the joint behavior of a pair of 
independent systems, each evolving according to the laws of the original 
Markov chain. By Theorem 8.1 there exists a Markov chain (X,,Y,), n= 
Cie ee having positive initial probabilities and transition probabilities 


P (Xna1 Yaar) = (KD (XY) = GA] =e, A). 


For n exceeding some no depending on i,j,k,/, the probability 
pj, kl) = p&p" is positive by Lemma 2. Therefore, the coupled chain is 
irreducible. (This proof that the coupled chain is irreducible requires only the 
assumptions that the original chain is irreducible and aperiodic, a fact 
needed again in the proof of Theorem 8.7.) 

It is easy to check that 7(ijj)=7,7, forms a set of stationary initial 
probabilities for the coupled chain, which, like the original one, must there- 
fore be persistent. It follows that, for an arbitrary initial state (i, j) for the 
chain {(X,,Y,)} and an arbitrary i, in S, one has P(X, Y D= 
i.o.] = 1. If 7 is the smallest integer such that X, = Y, = iņ, then T is finite 
with probability 1 under P;,. The idea of the proof is now this: X, starts in i 
and Y, starts in j; once X,=Y, =i occurs, X, and Y, follow identical 
probability laws, and hence the initial states i and j will lose their influence. 

By (8.13) applied to the coupled chain, if m <n, then 


PX n? F =(k,l),r=m] 
Pijl(X;,¥,) # (ipsig),t <m,(Xms Yn) = (igrto)] 
x Pri l (Xn =m? Yoa -m) = (k, 1)| 


Riiremiok pia, 


Adding out / gives PAX, =k, tT=m]=P;[rt=m]p{k ™, and adding out k 


gives P, [Y, =1, t =m] = P,[r =m] pi" m) Take k tyi pes iy probabilities, 
and add over m = 1,...,n: 


P [X =k, <n] =P,L¥, =k, 7 <n. 
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From this follows 
P [X =k] <P,[X, =k, 7 <n] + P,[7>n] 
=P [Y, =k, r<n)+P,[7>n] 
<PLY,=k]+P,j[7>n]. 
This and the same inequality with X and Y interchanged give 


| PO —p©|=|P,1X, =k] - PUY, =k] | <P,[7>n]. 


Since 7 is finite with probability 1, 


CO 


=0. 


(8.35) lim | p — pip? 


(This proof of (8.35) goes through as long as the coupled chain is irreducible 
and persistent— no assumptions on the original chain are needed. This fact 
is used in the proof of the next theorem.) 

By (8.31), 7, — pg = L7(pY — pR), and this goes to 0 by the M-test if 
(8.35) holds. Thus lim,, p& = 7,. As this holds for each stationary distribu- 
tion, there can be only one of them. 

It remains to show that the 7, are all strictly positive. Choose r and s so 
that p and pẹ are positive. Letting n >œ in (8.33) shows that m; is 
positive if 7, is; since some 7; is positive (they add to 1), all the m; must be 
positive. u 


Example 8.12. For the queueing model in Example 8.11 the equations 
(8.30) are 
Ty =Tolgt US 
Ty = Tol + Titi st Tato» 
Ma = Mota t Tilta Tati + T3t0, 


m =F, abo 1b) T On k23. 


Again write to,ti, t2, as q(1 — t),t, p(l — t). Since the last equation here is 
Tk =Q,,;, +p,_,, the solution is 


| A+B(p/q)"=A+B(t,/to)" if to*t, 
em NAAB if =t, 


for k > 2. If tọ <t, and Er, converges, then m, = 0, and hence there is no 
stationary distribution; but this is not new, because it was shown in Example 
8.11 that the chain is transient in this case. If tọ = t,, there is again no 
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Stationary distribution, and this is new because the chain was in Example 8.11 
shown to be persistent in this case. 

If to >t,, then La, converges, provided A = 0. Solving for mo and m; in 
the first two equations of the system above gives mo = Bt, and 7, = Bt, — 
to)/to. From L,7, = 1 it now follows that B = (tọ —1t,)/t,, and the m, can 
be written down explicitly. Since m, =B(t,/to)* for k >=2, there is small 
chance of a large queue length. | 


If t9 =f, in this queueing model, the chain is persistent (Example 8.11) 
but has no stationary distribution (Example 8.12). The next theorem de- 
scribes the asymptotic behavior of the p{” in this case. 


Theorem 8.7. Ifan irreducible, aperiodic chain has no stationary distribu- 
tion, then 


(8.36) lim p” = 0 
n 


for all i and j. 


If the chain is transient, (8.36) follows from Theorem 8.3. What is 
interesting here is the persistent case. 


Proor. By the argument in the proof of Theorem 8.6, the coupled chain 
is irreducible. If it is transient, then X,,( OY converges by Theorem 8.2, and 
the conclusion follows. 

Suppose, on the other hand, that the coupled chain is (irreducible and) 
persistent. Then the stopping-time argument leading to (8.35) goes through 
as before. If the pẹ” do not all go to 0, then there is an increasing sequence 
{n,} of integers along which some p{”) is bounded away from 0. By the 
diagonal method [A14], it is possible by passing to a subsequence of {n,,} to 
ensure that each p{"” converges to a limit, which by (8.35) must be indepen- 
dent of i. Therefore, there is a sequence {n,} such that lim, p(") = t, exists 
for all ¿ and j, where t; is nonnegative for all j and positive for some j. If M 
is a finite set of states, then E;< mt; = lim, Lj;<ypi" <1, and hence 0 < 
t =L;t; <1. Now Lye mPie Py; SPE“ P = Ep Pik PK; it is possible to pass 
to the limit (u — œ) inside the first sum (if M is finite) and inside the second 
sum (by the M-test), and hence Lem Pr; < Uy Dixt; =t;. Therefore, 
Lt, Dy; <t;; if one of these inequalities were strict, it would follow that 
Ekte = LiL, t, Prj < Ujt;, which is impossible. Therefore Lat, Pk; = t; for all 
j, and the ratios m;=t;/t give a stationary distribution, contrary to the 
hypothesis. m 


The limits in (8.34) and (8.36) can be described in terms of mean return 
times. Let 


(8.37) u= > nf; 
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if the series diverges, write u; =œ. In the persistent case, this sum is to be 
thought of as the average number of steps to first return to j, given that 
Koea 


Lemma 3. Suppose that j is persistent and that lim, piP =u. Then u > 0 if 
and only if u; < œ, in which case u = 1/p,;. 


Under the convention that 0 = 1/%, the case u = 0 and p,; = is consis- 
tent with the equation u = 1/p,. 


Proor. For k>0 let p,=2,.,f{; the notation does not show the 
dependence on j, which is fixed. Consider the double series 


O PO), AORTA 
Ia hy tA t 
Ga O 
Tih, I ai 


iy? E 
E oae 


The kth row sums to p, (k > 0) and the nth column sums to nf,” (n > 1), 
and so [A27] the series in (8.37) converges if and only if Ł,p, does, in which 
case 


(8.38) = DL, Pee 


Since j is persistent, the P-probability that the system does not hit j up to 
time n is the probability that it hits j after time n, and this is p,. Therefore, 


bi eI 
n—-1 
=P[X, E Jiy sively Ee +j] at; D P[X; =j, Xeer EI scary t 
k=1 
i=l 


k 
= Pn + De Di {aay 
k=1 


and since py = 1, 
= 4 0 
1 = po Pij + Pi Dj; Ds + py 1 Di + Pe Py 


Keep only the first k + 1 terms on the right here, and let n > œ; the result is 
1>(p,+-+-: +p,)u. Therefore u > 0 implies that Zp, converges, so that 
bj, <œ. 


"Since in general there is no upper bound to the number of steps to first return, it is not a simple 
random variable. It does come under the general theory in Chapter 4, and its expected value is 
indeed p; (and (8.38) is just (5.29), but for the present the interpretation of u; as an average is 
informal. See Problem 23.11. 
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Write x,, =p, pi" for0<k sn and x,, =0 forn< k. Then0<x,,< 
p, and lim, x,,, = p,U- If uw; <œ, then L,p, converges and it follows by the 
M-test that 1 = %_9%,, > Lz-oP,4- By (8.38), 1 = uju, so that u>0 and 


u=1/p;. a 


The law of large numbers bears on the relation u = 1/p,; in the persistent 
case. Let V, be the number of visits to state j up to time n. If the time from 
one visit to the next is about y,, then V, should be about n/p,;: V,/n = 1/p,. 
But (if X, =j) V,/n has expected value n~'LZ_, pi, which goes to u under 
the hypothesis of Lemma 3 [A30]. 

Consider an irreducible, aperiodic, persistent chain. There are two possi- 
bilities. If there is a stationary distribution, then the limits (8.34) are positive, 
and the chain is called positive persistent. It then follows by Lemma 3 that 
u;<œ and 7;=1/p, for all j. In this case, it is not actually necessary to 
assume persistence, since this follows from the existence of a stationary 
distribution. On the other hand, if the chain has no stationary distribution, 
then the limits (8.36) are all 0, and the chain is called null persistent. It then 
follows by Lemma 3 that u, = œ for all j. This, taken together with Theorem 
8.3, provides a complete classification: 


Theorem 8.8. For an irreducible, aperiodic chain there are three possibili- 
ties: 


(i) The chain is transient; then for all i and j, lim, pẹ? =0 and in fact 

Pe L 

(ii) The chain is persistent but there exists no stationary distribution (the null 
persistent case); then for all i and j, p{"” goes to 0 but so slowly that 
LDL? =œ, and j= ©. 

(iii) There exist stationary probabilities m, and (hence) the chain is persistent 
(the positive persistent case), then for all i and j, lim, pi? =1,>0 and 
Bb, = 1/1; <o. 


Since the asymptotic properties of the p{") are distinct in the three cases, 
these asymptotic properties in fact characterize the three cases. 


Example 8.13. Suppose that the states are 0,1,2,... and the transition 
matrix is 


es 2 8 a2 @ CO 6 Bee Lee 2:58 836 


where p; and q; are positive. The state i represents the length of a success 
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run, the conditional chance of a further success being p;. Clearly the chain is 
irreducible and aperiodic. 

A solution of the system Seu for testing for transience (with i) = 0) must 
have the form x,=x,/P; °** Py, Hence there is a bounded, noning 
solution, and the chain is transient, if and only if the limit a of po ``’ Pn İS 
positive. But the chance of no return to 0 (for initial state 0) in n steps is 
clearly py *** Pı; hence fo = 1 — æ, which checks: the chain is persistent if 
and only if a = 0. 

Every solution of the steady-state equations (8.30) has the form mg = ToPo 

- p,_,- Hence there is a stationary distribution if and only if £} Po ` `` Pk 
converges; this is the positive persistent case. The null persistent case is that 
in which pp `° p, > 0 but È, po ++ p, diverges (which happens, for exam- 
ple, if g, =1/k for k > 1). 

Since the chance of no return to 0 in n steps is Po Ph- in the 
persistent case (8.38) gives uo = Ly-oPo *** P,-1- In the null persistent case 
this checks with uọ=%œ; in the positive persistent case it gives po= 
ye -oTk/To = 1/mo, which again is consistent. E 


Example 8.14. Since ¥,p{" = 1, possibilities (i) and (ii) in Theorem 8.8 


are impossible in the aS case: A finite, irreducible, aperiodic Markov chain 
has a stationary distribution. a 


Exponential Convergence* 


In the finite case, Bye converges to m; at an exponential rate: 


Theorem 8.9. If the state space is finite and the chain is irreducible and 
aperiodic, then there is a stationary distribution {1,}, and 


WD) = n 
aa =Ap", 


where A > 0 and 0 <p <1. 


t 


ProoF. Let m‘” = min, p{ and M{™ = max; p$ i dea By 


* This topic may be omitted. 
‘For other proofs, see Problems 8.18 and 8.27. 
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Since obviously m” < M\”, 

(8.39) 0<m <m <6) <MO <M s1. 

Suppose temporarily that all the p,; are positive. Let s be the number of 
states and let 5 = min,, p; From £;p,; > 86 follows 0 < ô <s~'. Fix states u 


and v for the moment; let X’ denote the summation over j in S satisfying 
„j >P; and let X” denote summation over j satisfying p,,; <p,;. Then 


(8.40) 2a (Dag — Dig) T 2a (Ba; Do aoe es 


Since L'p,; + L'p,; Sse 


(8.41) E Pu =P,;) =1- LY pe oy oe 


Apply (8.40) and then (8.41): 


CHD a 


Puk 


Caa D 


— Pik Dr 


J 
= L (py — p,;) My? aly eGo: Pye 
= Dy (en, =p (ML? — mD) 
< (1—55)( Mf — m&). 
Since u and v are arbitrary, 


M+ — me” < (1 —58)( Me — m9). 


Therefore, M{” — mY <(1 — sô)". It follows by (8.39) that m\” and uM” 
have a common limit m; and that 


(8.42) | PP -m| < (1 -= s8)". 


Take A =1 and p = 1- sô. Passing to the limit in £, p\p,, = p&* she 

that the 7, are stationary probabilities. (Note that the proof thus aS m 

almost no aro of the preceding theory.) 
If the p;; are not all positive, apply Lemma 2: Since there are o) 

many states, there exists an m such that p{”) > 0 for all i « 

just treated, MOO — m\" <p’. Take A = ‘ans and then 


od ae) A 
BAR, 
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Example 8.15. Suppose that 


Po Pi | i 
Ppa P=) Po Ps -2 
Pı Po Po 


The rows of P are the cyclic permutations of the first row: p; =p,_;, j-i 
reduced modulo s. Since the columns of P add to 1 as Wen as the rows, the 
steady-state equations (8.30) have the solution 7,=5 '. If the p; are all 
positive, the theorem implies that pi (m) converges to s`! at an exponential 
rate.. TT Xe, Etos ate independent random variables with range 
Ce ees. s—1}, if each Y, has distribution (Dos, Dr D. ala lee 
Xo +Y, + -> +Y,, where the sum is reduced modulo s, then P[X, =j]—> 
s '. The X,, describe a random walk on a circle of points, and whatever the 
initial distribution, the positions become equally likely in the limit. a 


Optimal Stopping* 


Assume throughout the rest of the section that S is finite. Consider a function 
T on Q for which r(w) is a nonnegative integer for each w. Let F, = 


olXo X,,..., Xp); T is a stopping time or a Markov time if 
(8.43) [w:t(w) =n] E F, 
for n=0,1,.... This is analogous to the condition (7.18) on the gambler’s 


stopping time. It will be necessary to allow T(w) to assume the special value 
2, but only on a set of probability 0. This has no effect on the requirement 
(8.43), which concerns finite n only. 

If f is a real function on the state space, then f(X,), f(X,),... are simple 
random variables. Imagine an observer who follows the successive states 
X,,X,,... of the system. He stops at time +, when the state is X, (or 

X 6), and receives an reward or payoff f(X,). The condition (8. 43) 
prevents prevision on the part of the observer. This is a kind of game, the 
stopping time is a strategy, and the problem is to find a strategy that 
maximizes the expected payoff E[ f(X,)]. The problem in Example 8.5 had 
this form; there S = {1,2,...,r + 1), and the payoff function is f(i) = ives for 
i<r (set f(r+ 1)=0). ä 

If P(A) > 0 and Y = XŁ;y;lp, is a simple random variable, the B, formin 
finite decomposition of 0 into F sets, the conditional expected ' value 


*This topic may be omitted. 
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given A is defined by 


E[Y|A] = Ly, P(B\A). 


Denote by E, conditional expected values for the case A = [X, =i]: 


ELY] =E[YIX =i] = Ly; P(B): 


The stopping-time problem is to choose 7 so as to maximize simultaneously 
E{ f(X,)] for all initial states i. If x a in the range of f, U is finite, and 
if + is everywhere finite, then [w: f(X o (0) =x] = Uj-ole: Tw) =n, 
f(X,(@)) =x] lies in Z, and so fon yj is a simple random variable. In order 
that this always hold, put f(X. (w) = 0, say, if 7(@) =% (which happens 
only on a set of probability 0). 
The game with payoff function f has at i the value 


(8.44) v(i) = sup E;[ f(X,)], 


the supremum extending over all Markov times 7. It will turn out that the 
supremum here is achieved: there always exists an optimal stopping time. It 
will also turn out that there is an optimal 7 that works for all initial states 7. 
The problem is to calculate v(i) and find the best 7. If the chain is 
irreducible, the system must pass through every state, and the best strategy is 
obviously to wait until the system enters a state for which f is maximal. This 
describes an optimal 7, and v(i)=maxf for all i. For this reason the 
interesting cases are those in which some states are transient and others are 
absorbing (p;; = 1). 
A function g on S is excessive or superharmonic, ift 


(8.45) (i) > Dp,ei), S 


In terms of conditional expectation the requirement is g(i) > E{g(X))). 
Lemma 4. The value function v is excessive. 


Proor. Given e, choose for each j in S a “good” stopping time 1; 
satisfying Ef f(X,,)]>vGj)—«. By (8.43), [7 =n] =[(Xo,..., Xn) E€ Ln] for 
some set J;,, of (n + 1)-long sequences of states. Set r=n + 1 (nr > 0) on the 
set [X; =j]N([(X,,...,X,41) E Zn]; that is, take one step and then from the 
new state X, add on ihe “good” stopping time for that state. Then r is a 


‘Compare the conditions (7.28) and (7.35). 
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stopping time and 
EL f(X,)] = Èa u ur [X 2j (Xi. 70 Xy41) El jn? X41 =k] f(k) 
j K 


LP, PI Xos » Ay) GL jn? X, =k] f(k) 


| Il 
PE itis 
= a 

= 


Therefore, v(@i) > EI f(X,)] = Lp) — €) = LDV) —e. Since e was 
arbitrary, v is excessive. a 


Lemma 5. Suppose that ¢ is excessive. 


(i) For all stopping times t, (i) > El e(X,)]. 
(ii) For all pairs of stopping times satisfying o < 7, Ele(X,))= El e(X,)]. 


Part (i) says that for an excessive payoff function, r=0 represents an 
optimal strategy. 


Proor. To prove (i), put ty =min{r, N}. Then ry is a stopping time, 
and 


(8.46) E,[ o(X,,)] = 3 DAV ls X =k p(k) 


HAPIN Xy=k]e(k). 
k 


Since [r > N] =[r < N] € Fy _,, the final sum here is by (8.13) 


2 LPIT>N, Xy- =j, Xy =k] 9(k) 


F > El ta, Xn-1 =j] polk) = LAlr=Ns Xn-1 =j]e(i). 
j j 


Substituting this into (8.46) leads to El p(X, )] < E(X. ry_pl. Since 7o =O 
and El g(X,)] = ¢(i), it follows that Elo X, < ol) for all N. But for 
7(w) finite, p(X, xow) > P(X, (w) (there is equality for large N), we 
Ele(X,,)] > El e(X,)] by Theorem 5.4. 
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The proof of (ii) is essentially the same. If ry = min{r, o + N}, then ry isa 
Stopping time, and 


oo 


N-1 
E{o(X,)}= LL UPlo=m,r=m+n, Xmsn=klolk) 


m=0 n=0 k 


4 >. L_Plo=m,ram+QN, Xm+n =k] p(k). 
m=0 k 


Since [o =m, r>m+N]=[c=m]—-[c=m, r<m+N]E F,4ny_1, again 
Ele(X, I< Ele(X,, 1 < Ele(X,,)]. Since 79 = 0, part (ii) follows from 
part (i) by another passage to the limit. ea 


Lemma 6. Ifan excessive function œ dominates the payoff function f, then 
it dominates the value function v as well. 


By definition, to say that g dominates h is to say that g(i) > h(i) for all i. 


Proor. By Lemma 5, ¢(i) > El oe(X,)] = El f(X,)] for all Markov times 
7, and so g(i) > v(i) for all i. E 


Since 7 =Q is a stopping time, v dominates f. Lemmas 4 and 6 immedi- 
ately characterize v: 


Theorem 8.10. The value function v is the minimal excessive function 
dominating f. 


There remains the problem of constructing the optimal strategy 7. Let M 
be the set of states i for which v(i) = f(i); M, the support set, is nonempty, 
since it at least contains those i that maximize f. Let A = NX- o[X, £M] be 
the event that the system never enters M. The following argument shows that 
P{A)=0 for each i. As this is trivial if M = S, assume that M # S. Choose 
ô > 0 so that f(i) < v(i) — 6 for ie S — M. Now E| f(X,)] = E2 _ XL, Plt =n, 
X„=k]f(k), replacing the f(k) by v(k) or v(k) — 8 according as k € M or 
keS-M gives Ef f(X,)] < Elv(X,)] - 5P[X, E€ S- M] s< Elv(X,)]- 
6P,(A) <v(i)— P(A), the last inequality by Lemmas 4 and 5. Since this 
holds for every Markov time, taking the supremum over r gives P(A) = 0. 
Whatever the initial state, the system is thus certain to enter the support 
set M. 

Let Tow) = min[n: X (w) E M] be the hitting time for M. Then rọ is a 
Markov time, and tT) = 0 if X) € M. It may be that X,(w) €M for all n, in 
which case 7,)(w) = ©, but as just shown, the probability of this is 0. . 
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Theorem 8.11. The hitting time rto is optimal: E{ f(X,,)]=v@ for all i. 


Proor. By the definition of to, f(X,,) = v(X,,). Put (i) = El f(X) = 
E.v(x, I: The first step is to show that p is excessive. If 7, = min[n: 
n> 1, x, € M], then 7, is a Markov time and 


E{v(X,) |= L L Ply eM)... AEM My KO) 
n=] kEM 


2L 2, OL py EM X EM A ee 


n=] kEMjES 


X py; E;|vo(X,,)]: 


Since To < 7,, Elv(X,,)] = Elv(X, )] by Lemmas 4 and 5. 

This shows that ¢ is excessive. And g(i) < v(i) by the definition (8.44). If 
o(i)>f(i) is proved, it will follow by Theorem 8.10 that g(i) >v(i) and 
hence that g(i)=v(i). Since 7,=0 for X,EM, if i€M then g(i)= 
EJ f(X,)] =f@). Suppose that (i) < f(i) for some values of i in $ — M, and 
choose i, to maximize f(i) — g(i). Then y(i) = (i) + flig) — (ip) dominates 
f and is excessive, being the sum of a constant and an excessive function. By 
Theorem 8.10, y must dominate v, so that f(y) = vg), or flip) = v(vo). But 
this implies that i, E M, a contradiction G 


The optimal strategy need not be unique. If f is constant, for example, all 
strategies have the same value. 


Example 8.16. For the symmetric random walk with absorbing barriers at 
0 and r (Example 8.2) a function g on S={0,1,...,r} is excessive if 
gli) > li — 1) + 4g(i + 1) for 1 <i<r—1. The requirement is that ¢ give 
a concave function when extended by linear interpolation from S to the 
entire interval [0,r]. Hence v thus extended is the minimal concave function 
dominating f. The figure shows the geometry: the ordinates of the dots are 
the values of f and the polygonal line describes v. The optimal strategy is to 
stop at a state for which the dot lies on the polygon. 
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If f(r) =1 and f(i) =0 for i <r, then v is a straight line; v(i) =i/r. The 
optimal Markov time rọ is the hitting time for M = ({0,r}, and v(i) = 
Ei A(X, is the probability of absorption in the state r. This gives another 
solution of the gambler’s ruin problem for the symmetric case. a 


Example 8.17. For the selection problem in Example 8.5, the pij are given by (8.5) 
and (8.6) for 1<i<r, while p,,,,,41= 1. The payoff is fli) =i/r for i<r and 
f(r + 1) =0. Thus v(r + 1) = 0, and since v is excessive, 


r 


; i ; j 
(8.47) v(i) >g8(i)= ae N isi <7, 


By Theorem 8.10, v is the smallest function satisfying (8.47) and v(i) = f@) =i/r, 
1 <i <r. Since (8.47) puts no lower limit on v(r), it follows that v(r) = f(r) =1, and r 
lies in the support set M. By minimality, 


(8.48) v(i) = max{ f(i), g(i)}, 1 si<r. 


If ie M, then fli) =v(i) 21> Æa 'G — D7 FU) =fOL) 14 1G D and 


hence L’_,, (j — 1)~! < 1. On the other hand, if this inequality holds and i + 1,...,7 
all lie in M, then g(i)=L'_,,,~'G — DO) = fE iG — D7" < fC), so that 
ie€M by (8.48). Therefore, M = {i,,i, + 1,...,r,r + 1}, where i, is determined by 


PEN 


ESTEE @ len eoo s 
tee | poil sS iarla naw 


(8.49) = = 


If i<i,, so that i £ M, then v(i) > f(i) and so, by (8.48), 


a 


v(i) =(= LV non AN aH > o 


j=i+1 
ema ioii 1 
ye gotit j 
ETN a iais i. 
It follows by backward induction starting with i =i, — 1 that 
(8.50) hint = oe l agh, 
rE Poe N es 


is constant for 1 <i <i,. 

In the selection problem as originally posed, X, = 1. The optimal strategy is to 
stop with the first X, that lies in M. The princess should therefore reject the first 
i, — 1 suitors and accept the next one who is preferable to all his predecessors (is 
dominant). The probability of success is p, as given by (8.50). Failure can happen in 
two ways. Perhaps the first dominant suitor after i, is not the best of all suitors; in this 
case the princess will be unaware of failure. Perhaps no dominant suitor comes after 
i,; in this case the princess is obliged to take the last suitor of all and may be well 


SECTION 8. MARKOV CHAINS 139 


aware of failure. Recall that the problem was to maximize the chance of getting the 
best suitor of all rather than, say, the chance of getting a suitor in the top half. 

If r is large, (8.49) essentially requires that log r — log i, be near 1, so that i, = r/e. 
In this case, p, = 1 /e. 

Note that although the system starts in state 1 in the original problem, its 
resolution by means of the preceding theory requires consideration of all possible 
initial states. E] 


This theory carries over in part to the case of infinite S, although this 
requires the general theory of expected values, since f(X,) may not be a 
simple random variable. Theorem 8.10 holds for infinite S if the payoff 
function is nonnegative and the value function is finite.’ But then problems 
arise: Optimal strategies may not exist, and the probability of hitting the 
support set M may be less than 1. Even if this probability is 1, the strategy of 
stopping on first entering M may be the worst one of all.* 


PROBLEMS 


8.1. Prove Theorem 8.1 for the case of finite S by constructing the appropriate 
probability measure on sequence space $”: Replace the summand on the right 
in (2.21) by a, Puiu, `` ` Pu,_u,» and extend the arguments preceding Theorem 
2.3. If X,(-)=z,(-), then X,, X,,... is the appropriate Markov chain (here 
time is shifted by 1). 


8.2. Let Yọ, Y;,... be independent and identically distributed with P[Y, = 1] =p, 
PLY, =0l=q=1-p,p#q. Put X, = Y, + Yn +1 (mod 2). Show that Xo, X}... 
is not a Markov chain even though P[X,4, =J/X, 1 =i) = PLX,,,, =j]. Does 
this last relation hold for all Markov chains? Why? 


8.3. Show by example that a function f( Xo), f(X,),... of a Markov chain need not 
be a Markov chain. 


8.4. Show that 


o0 n 
ii fy X ofp = D pe vi aa m= E ps” 


k=0 n=] m=l1 n=1 


and prove that if j is transient, then ©,, p{”” <œ for each i (compare Theorem 
8.3(i)). If j is transient, then 


if = iE ap [( To of) 
n=] n=l 


'The only essential change in the argument is that Fatou’s lemma (Theorem 16.3) must be 
in place of Theorem 5.4 in the proof of Lemma 5. 
*See Problems 8.36 and 8.37. 
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Specialize to the case i =j: in addition to implying that i is transient (Theorem 
8.2(i)), a finite value for L_, pẹ” suffices to determine f; exactly. 


8.5. Call {x,} a subsolution of (8.24) if x; < Ljq,;x; and 0 < x; <1, i € U. Extending 
Lemma 1, show that a subsolution {x,} satisfies x; <0;: The solution {o,} of 
(8.24) dominates all subsolutions as well as all solutions. Show that if x; = Ljqijx; 
and —1 <x; < 1, then {|x;|} is a subsolution of (8.24). 


8.6. Show by solving (8.27) that the unrestricted random walk on the line (Example 
8.3) is persistent if and only if p = 3. 


8.7. (a) Generalize an argument in the proof of Theorem 8.5 to show that f,, = 
Pik + Xj 4% Pijfjx- Generalize this further to 


iW ce ee 


+ > BIXA k,..., Xa eee 


J#k 


(b) Take k=i. Show that f,,>0 if and only if PAX, #i,...,X,_, ¥i, 
X,, =j]> 0 for some n, and conclude that i is transient if and only if f;; < 1 for 
some j #1 such that fj; > 0. 

(c) Show that an irreducible chain is transient if and only if for each i there is a 
J #1 such that fj; <1. 


8.8. Suppose that S = {0,1,2,...}, Poo = 1, and f;ọ> 0 for all i. 
(a) Show that P,(U%_,[X,, =J i.0.]) = 0 for all i. 
(b) Regard the state as the size of a population and interpret the conditions 
Poo = 1 and f,. > 0 and the conclusion in part (a). 


8.9. 8.57 Show for an irreducible chain that (8.27) has a nontrivial solution if and 
only if there exists a nontrivial, bounded sequence {x,} (not necessarily nonnega- 
tive) satisfying x;=L;,;, PijXj, i #io. (See the remark following the proof of 
Theorem 8.5.) 


8.10. 7 Show that an irreducible chain is transient if and only if (for arbitrary ig) 
the system y,;=L,p;;y;, 1#ig (sum over all j), has a bounded, nonconstant 
solution {y,, i € S}. 


8.11. Show that the P,-probabilities of ever leaving U for ie U are the minimal 
solution of the system. 


z= E Pijz; + y Dp CEU, 
(8.51) jeu jU 
Diz ele red. 


The constraint z;<1 can be dropped: the minimal solution automatically 
satisfies it, since z;=1 is a solution. 


8.12. Show that sup;; ng(i, j) = © is possible in Lemma 2. 


SECTION 8. MARKOV CHAINS 141 


8.13. 


8.14. 


8.16. 


8.17. 


8.18. 


Suppose that {7;} solves (8.30), where it is assumed that L,|77;| < , so that the 
left side is well defined. Show in the irreducible case that the m; are either all 
positive or all negative or all 0. Stationary probabilities thus exist in the 
irreducible case if and only if (8.30) has a nontrivial solution {m} (Zir; 
absolutely convergent). 


Show by example that the coupled chain in the proof of Theorem 8.6 need not 
be irreducible if the original chain is not aperiodic. 


. Suppose that S consists of all the integers and 


= = aat ih 
P=) POO PONI 35 


Preai =a, Pe ktis kaci 
Pk k-11 DP; Pk, k+1 ~” 4, k21. 


Show that the chain is irreducible and aperiodic. For which p’s is the chain 
persistent? For which p’s are there stationary probabilities? 


Show that the period of j is the greatest common divisor of the set 


(8.52) [ie Se FO SO 


+ Recurrent events. Let f,, f,,... be nonnegative numbers with f= L7_.f,, < 
1. Define u,,u5,... recursively by u, =f, and 


(8.53) Up = Jini, th tence alti 


(a) Show that f <1 if and only if L,,u,, < œ. 
(b) Assume that f= 1, set u = ae and assume that 


(8.54) gcd[n: n>1, f, > 0] =1. 


Prove the renewal theorem: Under these assumptions, the limit u = lim,„ u,, 
exists, and u > 0 if and only if u < œ, in which case u=1/p. 

Although these definitions and facts are stated in purely analytical terms, 
they have a probabilistic interpretation: Imagine an event @ that may occur at 
times 1,2,.... Suppose f,, is the probability @ occurs first at time n. Suppose 
further that at each occurrence of @ the system starts anew, so that f,, is the 
probability that @ next occurs n steps later. Such an @ is called a recurrent 
event. If u,, is the probability that & occurs at time n, then (8.53) holds. The 
recurrent event @ is called transient or persistent according as f < 1 or f= 1, it 
is called aperiodic if (8.54) holds, and if f= 1, mw is interpreted as the mean 
recurrence time. 


(a) Let 7 be the smallest integer for which X, = iy. Suppose that the state space 

is finite and that the pj; are all positive. Find ap such that max ,(1 — Piiy ) sapsi 

and hence P[7 >n] <p” for all i. 

(b) Apply this to the coupled chain in the proof of Theorem 8.6: |p% — p< 
p”. Now give a new proof of Theorem 8.9. a 
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8.19. A thinker who owns r umbrellas wanders back and forth between home and 
office, taking along an umbrella (if there is one at hand) in rain (probability p) 
but not in shine (probability q). Let the state be the number of umbrellas at 
hand, irrespective of whether the thinker is at home or at work. Set up the 
transition matrix and find the stationary probabilities. Find the steady-state 
probability of his getting wet, and show that five umbrellas will protect him at 
the 5% level against any climate (any p). 


8.20. (a) A transition matrix is doubly stochastic if Ł;p;; = 1 for each j. For a finite, 

irreducible, aperiodic chain with doubly stochastic transition matrix, show that 
the stationary probabilities are all equal. 
(b) Generalize Example 8.15: Let S be a finite group, let p(i) be probabilities, 
and put p;;=p(j-i '), where product and inverse refer to the group operation. 
Show that, if all p(i) are positive, the states are all equally likely in the limit. 
(c) Let § be the symmetric group on 52 elements. What has (b) to say about 
card shuffling? 


8.21. A set C in S is closed if L;-¢p;;=1 for ie C: once the system enters C it 
cannot leave. Show that a chain is irreducible if and only if S has no proper 
closed subset. 


8.22. T Let T be the set of transient states and define persistent states i and j (if 
there are any) to be equivalent if f;;> 0. Show that this is an equivalence 
relation on S — T and decomposes it into equivalence classes C,,C,,..., so that 
S=TUC,UC,U ---. Show that each C,, is closed and that f;; = 1 for i and j 


in the same C. 


8.23. 8.11 8.217 Let T be the set of transient states and let C be any closed set of 
persistent states. Show that the P,-probabilities of eventual absorption in C for 
i €T are the minimal solution of 


Vie ys Dee sr S Doa UE 
(8.55) jE IEC 
Oxy, sil ienr. 


8.24. Suppose that an irreducible chain has period t > 1. Show that § decomposes 
into sets So,...,5,_, such that p;;> 0 only if i € S, and jE S,,, for some v 
(v + 1 reduced modulo t). Thus the system passes through the S, in cyclic 
succession. 


8.25. 1 Suppose that an irreducible chain of period ¢ > 1 has a stationary distribu- 
tion {7,}. Show that, if i€ S, and j€S,,, (v +a reduced modulo ¢), then 


lim, pif'*® =m, Show that lim, n7'L?,_, pv" = m;/t for all i and j. 


8.26. Eigenvalues. Consider an irreducible, aperiodic chain with state space {1,..., 5}. 
Let ro =(1,,...,7,) be (Example 8.14) the row vector of stationary probabili- 
ties, and let cy be the column vector of 1’s; then rọ and Co are left and right 
eigenvectors of P for the eigenvalue A = 1. 


(a) Suppose that r is a left eigenvector for the (possibly complex) eigenvalue A: 
rP = Ar. Prove: If A =1, then r is a scalar multiple of rọ (A = 1 has n 
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8.27. 


8.28. 


8.29. 


8.30. 


multiplicity 1). If A #1, then |A| <1 and rco = 0 (the 1 x 1 product of 1 X s and 
s X 1 matrices). 

(b) Suppose that c is a right eigenvector: Pc = Ac. If A =1, then c is a scalar 
multiple of co (again the geometric multiplicity is 1). If A # 1, then again |A| < 1, 
and roc = 0. 


T Suppose P is diagonalizable; that is, suppose there is a nonsingular C such 
that C7 'PC = A, where A is a diagonal matrix. Let A,,...,A, be the diagonal 
elements of A, let c,,...,c, be the successive columns of C, let R=C~', and 
let Fake r, be the successive rows of R. 


(a) Show that c; and r, are right and left eigenvectors for the eigenvalue A,, 
tk can s. Show that r;c; = ô;;. Let A; =c,r; (s X s). Show that A” is a diagonal 
matrix with diagonal elements A}, ..., A% andthat P” = CA"R = X; -à Aun 21. 
(b) Part (a) goes through under the sole assumption that P is a diagonalizable 
matrix. Now assume also that it is an irreducible, aperiodic stochastic matrix, 
and arrange the notation so that A, = 1. Show that each row of A, is the vector 


Onnee a.) of stationary probabilities. Since 
S 
(8.56) Pp" =A) + ONA 
u=2 


and |à „|< 1 for 2 <u <s, this proves exponential convergence once more. 
(c) Write out (8.56) explicitly for the case s = 2. 
(d) Find an irreducible, aperiodic stochastic matrix that is not diagonalizable. 


t (a) Show that the eigenvalue A = 1 has geometric multiplicity 1 if there is 
only one closed, irreducible set of states; there may be transient states, in which 
case the chain itself is not irreducible. 


(b) Show, on the other hand, that if there is more than one closed, irreducible 
set of states, then A = 1 has geometric multiplicity exceeding 1. 


(c) Suppose that there is only one closed, irreducible set of states. Show that 
the chain has period exceeding 1 if and only if there is an eigenvalue other than 
1 on the unit circle. 


Suppose that (X,,} is a Markov chain with state space S, and put Y, = (Xn, Xn +1). 
Let T be the set of pairs (i, j) such that p,; > 0 and show that {Y,} is a Markov 
chain with state space T. Write down the transition probabilities. Show that, if 
{X,,} is irreducible and aperiodic, so is {Y,}. Show that, if m; are stationary 
probabilities for {X,,}, then 7, p;, are stationary probabilities for {Y,,}. 


6.10 8.297 Suppose that the chain is finite, irreducible, and aperiodic and 
that the initial probabilities are the stationary ones. Fix a state i, let A, =[X,= 
i], and let N, be the number of passages through i in the first n steps. Calculate 
a, and B, as defined by (5.41). Show that B,, — a, = O(1/n), so that n-'N, > r; 
with probability 1. Show for a function f on the state space that n~'Dz_, f(X,) 
> Em, fli) with probability 1. Show that n~'D%_,g(X,, X,,,) > 
Lm; pijg(i, j) for functions g on S X S. 


144 


8.31. 


8.32. 
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6.14 8.307 If Xo(w)=io,...,X,(@) =i, for states ip,...,i,, put P(w) = 
7, Piyi, °° * Pi,_,i» SO that p,(@) is i. gre of the observation observed. 
Show that -n~ ' log p,(w) >h = pi; log p;; with probability 1 if the 
chain is finite, irreducible, and aE, “tetend to this case the notions of 
source, entropy, and asymptotic equipartition. 


A partensa {X,,} is a Markov chain of second order if PX, +1 =i|Xo= 
days skh La T= PLY, 1) Ss a-i F baai Xn =i] S Diya ugyi Show that noth- 
ing really new is involved because the sequence of pairs (X„, X,,,,) is an 
ordinary Markov chain (of first order). Compare Problem 8.29. Generalize this 
idea into chains of order r. 


. Consider a chain on S = {0,1,...,7r}, where 0 and r are absorbing states and 
Pi i41 =P; > 0, p; ;-1 =; =1—p;>0 for 0<i<r. Identify state i with a point 
z; on the line, where 0 =z) < --- <z, and the distance from z; to Z;,, 1S 4;/P; 


times that from z,;_, to z;. Given a function g on S, consider the associated 
function ¢ on [0, z,] defined at the z; by ¢(z;) = (i) and in between by linear 
interpolation. Show that ¢ is excessive if and only if ¢ is concave. Show that 
the probability of absorption in r for initial state i is ¢;_,/t,_,, where t;= 
091 °° Ux /P1 `` ` Py. Deduce (7.7). Show that in the new scale the expected 
distance moved on each step is 0. 


. Suppose that a finite chain is irreducible and aperiodic. Show by Theorem 8.9 


that an excessive function must be constant. 


8.35. A zero-one Ne Let the state space § contain s points, and suppose that 


8.36" 


8.37. 


The 


€,, = sup; |p? — 7;| > 0, as holds under the hypotheses of Theorem 8.9. For 
a mes let 2} be the o-field generated by the sets [X, =u,,..., Xp = Up]: 
Let GF, = o(U2_ if.) and J= (\*_,7,. Show that |P(A ABS P(A) P(B)| < 
e, PD n) for AEJ? and BE Y?*™: the e,,, can be suppressed if the 
snifial probabilities are aie stationary ones. Show that this holds for A eg 
and Be JY,,,,. Show that Ce 7 implies that P(C) is either 0 or 1. 


Alter the chain in Example 8.13 so that qy = 1 — py = 1 (the other p; and q; still 
positive). Let B = lim, Pı ‘-: Pa and assume that B > 0. Define a payoff func- 
tion by f(0)=1 and f(i)=1-—fiọ for i>0. If Xo,...,X, are positive, put 
g, =n; otherwise let oa, be the smallest k such that X,=0. Show that 
E; TEA )]> 1 as n > ~, so that v(i) = 1. Thus the support set is M = {0}, and 
for an initial state i > 0 the probability of ever hitting M is fọ <1. 

For an arbitrary finite stopping time r, choose n so that P[t <n =a,] > 0. 
Then El f(X,)) <1—fisno Plt <n=o,]<1. Thus no strategy achieves the 
value v(i) (except of course for i = 0). 


1 Let the chain be as in the preceding problem, but assume that B =0, so 
that fiọ = 1 for all i. Suppose that A,,A3,... excaed 1 and that A, o A, WAS 
œ; put f(0)=0 and f(i)=A, +++ A;_,/p, ‘** p;_1. For an AOR a " (finite) 
stopping time 7, the event [r =n] must AYA the form [(Xp,...,X,) Eln ] for 
some set J, of (n + 1)-long sequences of states. Show that for each i "there i is at 


final three problems in this section involve expected values for random variables with 


infinite range. 
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most one n> 0 such that (i,i+1,...,i +n) €/J,. If there is no such n, then 
El f(X,)] = 0. If there is one, then 


EL #(X,)] = P(X, Xn) = Fn] +n), 


and hence the only possible values of E,[ f(X,)] are 
0, FG), PfI SS) Bibir O t NEAN - 


Thus v(i) =f(i)A/A, +++ A;_, for i > 1; no strategy this value. The support set 
is M = {0}, and the hitting time rọ for M is finite, but El f(X,,)] = 0. 


8.38. 5.127 Consider an irreducible, aperiodic, positive persistent chain. Let 7; be 
the smallest n such that X, =j, and let m;; = Ej[7,]. Show that there is an r 
such that p = PIX, #/j,..., X,_, #j, X, =i] is positive; from f{"*” = pf and 
mj; <œ, conclude that m,; <œ and mj; = L7_9Pi7, >n]. Starting from pf) = 
Ee aS Py > Show that 


n 


n 
E (29 -pP)=1- E pP] >m] 
m=0 


t=1 


Use the M-test to show that 


THis = Weta enue — p§). 


n=1 


If i =j, this gives m;; = 1/7, again; if i +j, it shows how in principle m;j can be 
calculated from the transition matrix and the stationary probabilities. 


SECTION 9. LARGE DEVIATIONS AND THE LAW 
OF THE ITERATED LOGARITHM* 


It is interesting in connection with the strong law of large numbers to 
estimate the rate at which S,,/n converges to the mean m. The proof of the 
strong law used upper bounds for the probabilities P[|S,, — m| > a] for large 
a. Accurate upper and lower bounds for these probabilities will lead to 
the law of the iterated logarithm, a theorem giving very precise rates for 
S„/n >m. 

The first concern will be to estimate the probability of large deviations 
from the mean, which will require the method of moment generating func- 
tions. The estimates will be applied first to a problem in statistics and then to 
the law of the iterated logarithm. 


*This section may be omitted. 
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Moment Generating Functions 


Let X be a simple random variable asssuming the distinct values X,,...,%, 
with respective probabilities p,,..., Pı Its moment generating function is 


l 


(9.1) M(t) =E[e'*]= È pe 
i=1 


(See (5.19) for expected values of functions of random variables.) This 
function, defined for all real t, can be regarded as associated with X itself or 
as associated with its distribution—that is, with the measure on the line 


having mass p; at x; (see (5.12)). i 
If c =max,|x,|, the partial sums of the series Ee k gi X /k! are 


bounded by eltic and so the corollary to Theorem 5.4 sain 
(9.2) M(t) = Ly EN 


Thus M(t) has a Taylor expansion, and as follows from the general theory 
[A29], the coefficient of t* must be M“(0)/k! Thus 


(9.3) E[ X*]=M(0). 
Furthermore, term-by-term differentiation in (9.1) gives 


l 
M(t) = F parte =E[X*e*], 
i=l 


taking t = 0 here gives (9.3) again. Thus the moments of X can be calculated 
by successive differentiation, whence M(t) gets its name. Note that M(0) = 1. 


Example 9.1. If X assumes the values 1 and 0 with probabilities p and 
q =1 -—p, as in Bernoulli trials, its moment generating function is M(t) = 
pe' +q. The first two moments are M'(0)=p and M”(0)=p, and the 
variance is p — p° = pq. a 


If X,,..., X, are independent, then for each t (see the argument follow- 
ing (5.10)), e'*!,..., e'*" are also independent. Let M and M,,..., M,, be the 
respective moment generating functions of S=X,+-::: +X, and of 
X,,.--,X,3 of course, e'f = [],e'*', Since by (5.25) expected values multiply 
for independent random variables, there results the fundamental relation 


(9.4) M(t) =M,(t) <: M,(t). 
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This is an effective way of calculating the moment generating function of 
the sum S. The real interest, however, centers on the distribution of S, and 
so it is important to know that distributions can in principle be recovered 
from their moment generating functions. 

Consider along with (9.1) another finite exponential sum N(t) = Lagje”, 
and suppose that M(t) = N(t) for all ¢. If x; = max x; and y, = max y;, then 
M(t) ~ pi e™ and N(t) ~ qpe” as t > œ, and so Kint Up and Py” dig: The 
same argument now applies to X; 4; pe’! = Lj 4 j,4e 7, and it follows induc- 
tively that with appropriate relabeling, x; =y; and p;=q; for each i. Thus 
the function (9.1) does uniquely determine the x; and p;. 


Example 9.2. If X,,..., X, are independent, each assuming values 1 and 
0 with probabilities p and q, then S=X,+ °°: +X, is the number of 
successes in n Bernoulli trials. By (9.4) and Example 9.1, S has the moment 
generating function 


Ele] =(pe'+q) = D (jota "ataia 
k=0 


The right-hand form shows this to be the moment generating function of a 


distribution with mass (prag at the integer k, 0 < k <n. The uniqueness 


just established therefore yields the standard fact that P[S = k] = (eaa 
E 


The cumulant generating function of X (or of its distribution) is 
(9.5) C(t) = log M(t) = log E| e'*]. 


(Note that M(t) is strictly positive.) Since C'=M'/M and C” =(MM"— 
(M'})/M?, and since M(0) = 1, 


(9.6) OEE O mE[X), E O Vane 


Let m, =E[X*]. The leading term in (9.2) is m)=1, and so a formal 
expansion of the logarithm in (9.5) gives 


(9.7) CHa c] 3 Tte) é 
v=1 


Since M(t) > 1 as t > 0, this expression is valid for ¢ in some neighborhood 
of 0. By the theory of series, the powers on the right can be expanded and 
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terms with a common factor ti collected together. This gives an expansion 


(9.8) C(t) = 3 e, 
(= 
valid in some neighborhood of 0. 

The c, are the cumulants of X. Equating coefficients in the expansions 
(9.7) and (9.8) leads to c, =m, and c, =m, — mî, which checks with (9.6). 
Each c, can be expressed as a polynomial in m,,...,m, and conversely, 
although the calculations soon become tedious. If E[X] = 0, however, so that 
m, =c,=0, it is not hard to check that 


2 


Taking logarithms converts the multiplicative relation (9.4) into the addi- 
tive relation 


(9.10) C(@) =]G ie: Sen) 


for the corresponding cumulant generating functions; it is valid in the 
presence of independence. By this and the definition (9.8), it follows that 
cumulants add for independent random variables. 

Clearly, M’(t)=E[X7e'*]>0. Since (M'(t))? = E?[ Xe'*] < Ele’*]- 
E(|X*e‘*]=M(t)M"(t) by Schwarz’s inequality (5.36), C’(t)>0. Thus the 
moment generating function and the cumulant generating function are both 
convex. 


Large Deviations 


Let Y be a simple random variable assuming values y; with probabilities Pj- 
The problem is to estimate P[Y > a] when Y has mean 0 and a is positive. It 
is notationally convenient to subtract a away from Y and instead estimate 
P[Y = 0] when Y has negative mean. 

Assume then that 


(9.11) E[Y]<0, P[Y>0]>0, 


the second assumption to avoid trivialities. Let M(t) = L; pje”i be the 
moment generating function of Y. Then M'(0) <0 by the first assumption in 


M (t) 
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(9.11), and M(t) > œ as t > œ by the second. Since M(t) is convex, it has its 
minimum p at a positive argument r: 


(9.12) inf M(t) = M(r) =p, V<p<ly FSO, 
t 


Construct (on an entirely irrelevant probability space) an auxiliary random 
variable Z such that 


(9.13) Pe =y] =a ea 


for each y, in the range of Y. Note that the probabilities on the right do add 
to 1. The moment generating function of Z is 


Ty art! 
(9.14) ees oe pea mr) 
J 


and therefore 
M' M” 
(9.15) Elz] -#0 =o, s*=E[Z?| ==) 50, 


For all positive ¢, P[Y>0]=Ple'*>1]<M(t) by Markov’s inequality 
(5.31), and hence 
(9.16) P[Y>0] <p. 


Inequalities in the other direction are harder to obtain. If X denotes 
summation over those indices j for which y, > 0, then 


(9.17) P(X 20) =) Breede ¢ WBZ 


Put the final sum here in the form e~’, and let p = P[Z > 0). By (9.16), 0 > 0. 
Since log x is concave, Jensen’s inequality (5.33) gives 


-0=log Yep 'P[Z =y,] + log p 
È (-ry,)p7'P[Z =y,] +log p 


IV 


1 yj 
-7p L =P[Z=y,] + log p. 
By (9.15) and Lyapounov’s inequality (5.37), 


L “P[Z=y] < *Elizi] < 1 E222] =1. 
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The last two inequalities give 


(9.18) 0<0< —log P[Z > 0]. 


TS 
P[Z>=0] 
This proves the following result. 

Theorem 9.1. Suppose that Y satisfies (9.11). Define p and 7 by (9.12), let 
Z be a random variable with distribution (9.13), and define s? by (9.15). Then 
PLY > 0] =pe~°, where 0 satisfies (9.18). 


To use (9.18) requires a lower bound for P[Z > 0). 


Theorem 9.2. /f E[Z]=0, E[Z?] = s?, and E[Z*] = €* > 0, then PÍZ = 0) 
> 54/4¢4t 


Proor. Let Z*= Zlz>o and Z = + 2lz <oy Then Z* and Z are 
nonnegative, Z = Z+- Z_,Z2=(Z*)*+(Z~), and 


(9.19) s?=E|(Z*)*] + E[(Z-)’]. 
Let p =P[Z > 0]. By Schwarz’s inequality (5.36), 
E[(2*)'] = E[hzz02?] 
< EAL iso] EPZ] =p. 
By Hölder’s inequality (5.35) (for p = å and q = 3) 
E|(2-)'] =£[(2-)°°(z-)4] 
< EAZ JEA <E(Z-]é4?, 


Since E[Z]=0, another application of Hélder’s inequality (for p =4 and 
gi ss 
q = 3) gives 


ELZ] =E[Z}] = E[Zhz>0| 
SEI ZIRALA ] =tp3/4, 


[Z = 0] 


Combining these three inequalities with (9.19) gi 2 « pif 
(Ep POE = 2 p/E?. q ( ) gives s SPER 


tFor a related result, see Problem 25.19. 


x F 1 
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Chernoff’s Theorem‘ 


Theorem 9.3. Let X,, X,,... be independent, identically distributed simple | 
random variables satisfying E[X,)<0 and PIX, >0]>0, let M(t) be their 
common moment generating function, and put p = inf, M(t). Then 


(9.20) lim 1log P[X,+ +++ +X,, = 0) = log p. 


ProoF. Put Y, =X + +X, Then E[Y,])<0 and PIY > ope 

ni X, > US and so che uppotetet of Theorem 9.1 are satisfied. Define 
Pp and T, by inf, M,(t) = M,(7,) =p,, where M_(t) is the moment generating 
function of jk Since M(t)=M"(t), it follows that P, =p” and 7,, = 7, where 
M(t) =p. 

Let Z, be the analogue for Y, of the Z described by (9.13). Its moment 
generating function (see (9.14)) is M,(7 + t)/p” =(M(r + t)/p)”. This is also 
the moment generating function of V, + --- +V, for independent random 
variables V,,...,V,, each having moment generating function M(r + t)/p. 
Now each V, has (see (9.15)) mean 0 and some positive variance ø? and 
fourth moment &* independent of i. Since Lip pS harg the same moments 
as V,+--- +V,, it has mean 0, variance s? =no°, and fourth moment 
2h = në* + 3n(n — 1)o t = O(n?) (see (6.2)). By Tee 92, PIZ == 
S Jae >a a son positive a independ ni of n. By Theorem 9.1 then, 

PLY, > 0] = n where 0 < 6, <7,s,a '—loga=ta ‘evn — log a. This 
gives (9.20), a one in fact, that the rate of convergence is O(n~!/). m 


This result is important in the theory of statistical hypothesis testing. An informal 
treatment of the Bernoulli case will illustrate the connection. 

Suppose S, =X; + +: +X,, where the X; are independent and assume the values 
1 and 0 with a alie p are g: Now BES. >na] = P[XZ_ (CX, — a) > 0], and 
Chernoff’s theorem applies if p <a < 1. In this case M(t) = Ele Xiá a))=e—'(pe'+ 
q). Minimizing this shows that the p of Chernoff’s theorem satisfies 


a b 
— SS = l — — 
logp =K(a,p)=a og > t+ bloga, 


where b = 1 — a. By (9.20), n~! log P[S,, = na] > —K(a, p); express this as 


(9.21) P[S, 2 na] me "iam, 


a te, 
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P =P, where p, <p >. Given the observed results X,,..., X, of n Bernoulli trials, 
one decides in favor of H, if S, >na and in favor of H, if S,,<na, where a is some 
number satisfying p; <a <p,. The problem is to find an advantageous value for the 
threshold a. 

By (9.21), 


(9.22) PPSy nally) =e aa 


where the notation indicates that the probability is calculated for p = p,—that is, 
under the assumption of H,. By symmetry, 


(9.23) P(S,, <nalH, | Steg" SE 


The left sides of (9.22) and (9.23) are the probabilities of erroneously deciding in favor 
of H, when H, is, in fact, true and of erroneously deciding in favor of H, when H, 
is, in fact, true—the probabilities describing the level and power of the test. 

Suppose a is chosen so that K(a, p,)=K(a,p,), which makes the two error 
probabilities approximately equal. This constraint gives for a a linear equation with 
solution 


log(41/42) 


(9.24) a= a( pı, P2) = log( p>/P,) + log(4,/42) ’ 


where q; = 1 —p;. The common error probability is approximately e ~ ”*%(%P for this 
value of a, and so the larger K(a, p,) is, the easier it is to distinguish statistically 
between p, and p>. 

Although K(a(p,, p>), pı) is a complicated function, it has a simple approximation 
for p, near p,. As x > 0, log(1 +x) =x — 3x* + O(x?). Using this in the definition of 
K and collecting terms gives 


x? 


(9.25) K(p+x,p)= Ie TOS e 


Fix p; =p, and let p, =p + t; (9.24) becomes a function y(t) of t, and expanding the 
logarithms gives 


(9.26) W(t)=p+ zt+O(t?), t>0, 


after some reductions. Finally, (9.25) and (9.26) together imply that 


12 
(9.27) K(#(t), p) = Bpq + O(t?), t-— 0. 

In distinguishing p; =p from p, =p +t for small ¢, if a is chosen to equalize the 
two error probabilities, then their common value is about e~”! /8P4, For t fixed, the 
nearer p is to 3 the larger this probability is and the more difficult it is to distinguish 
p from p+t. As an example, compare p=.1 with p=.5. Now .36nt?/8(.1).9) = 
nt? /8(.5X.5). With a sample only 36 percent as large, .1 can therefore be distin- 
guished from .1+1¢ with about the same precision as .5 can be distinguished from 
tt. 
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The Law of the Iterated Logarithm 


The analysis of the rate at which S,,/n approaches the mean depends on the 
following variant of the theorem on large deviations. 


Uheorem 9.4. Let S, =X,+ ++: +X,, where the X, are independent and 


ientically distributed simple random variables with mean 0 and variance 1. If 
are constants satisfying 


nes a 
9.28) a, > ©, i >0, 
n 
then 
ne? 
(9.29) P[s,>a,Vn] = ene! 60/2 


for a sequence ¢, going to 0. 


Proor. Put Y,=S,—a,Vn =Li_,X,—a,/vVn). Then ELY,] <0. Since 
X, has mean 0 and variance 1, P|X,>0]>0, and it follows by (9.28) 
that P[X,>a,/vVn]>0 for n sufficiently large, in which case PLY, > 0]= 
PAX say vn >0]>0. Thus Theorem 9.1 applies to Y, for all large 
enough n. 

Let M(t), p,,7,, and Z, be associated with Y, as in the theorem. If m(t) 
and c(t) are the moment and cumulant generating functions of the X,, then 
M_(t) is the nth power of the moment generating function e~'¢"/V"m(t) of 
X,—a,„/vVn , and so Y, has cumulant generating function 


(9.30) C,(t) = —ta,Vvn +nc(t). 


Since 7, is the unique minimum of C,(t), and since C’(t)= —a,Vn + 
nc'(t), t, is determined by the equation c'(7,)=a,/ Vn . Since X , has mean 
0 and variance 1, it follows by (9.6) that 


(9.31) c(0) =c’(0)=0,  œ(0)=1. 


Now c’(t) is nondecreasing because c(t) is convex, and since c’(r,,) =a,,/ Vn 
goes to 0, 7, must therefore go to 0 as well and must in fact be O(a, /Vn). 
By the second-order mean-value theorem for c’(t),a,/V¥n =c'(r,) =7, + 
O(77), from which follows 


E T, 
(9.32) T” sofia) 
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By the third-order mean-value theorem for c(t), 


log p, = C,,(7,) = —7,a,V¥n + NC( Fy) 


= -r a yn + n[ 57) + Olr): 


n n 
Applying (9.32) gives 
-i ap tin? 2 
(9.33) log p, = — 374, + o(a;). 


Now (see (9.14)) Z, has moment generating function M,(7, + t)/p„ and 
(see (9.30)) cumulant generating function D,(t) = C,(7, + t) — log p, = —(, 
+1)a,Vn +nc(t +7,)—log p,. The mean of Z,, is D/(0) = 0. Its variance s; 
is D"(0); by (9.31), this is 


(9.34) s? =ne" (an) =n(c"(0) + O(7,)) =n + o(1)). 


The fourth cumulant of Z, is D””(0) = nc"”(r„,) = O(n). By the formula (9.9) 
relating moments and cumulants (applicable because E[Z,] = 0), E[Z7]= 
3s4 + D”"(0). Therefore, E[Z*]/si > 3, and it follows by Theorem 9.2 that 
there exists an a such that P[Z, > 0] > a > 0 for all sufficiently large n. 

By Theorem 9.1, P[Y,>0]=p,e ™ with 0 < 0, <71,s,a '+loga. By 
(9.28), (9.32), and (9.34), 6, = O(a,) =o(a2), and it follows by (9.33) that 
PLY, > 0] = e7 1a +0(1))/2_ E 


The law of the iterated logarithm is this: 
Theorem 9.5. Let S,=X,+ -:: +X,, where the X, are independent, 


identically distributed simple random variables with mean 0 and variance 1. 
Then 


(9.35) P|lim sup 


Sn 
a y2nlogiogn |7" 
Equivalent to (9.35) is the assertion that for positive e 
(9.36) P|S, = (1 +€)V2n loglog ni.o.| = 0 
and 
(9.37) P[S, = (1 — e)y2n loglog ni.o.} = 1. 


The set in (9.35) is, in fact, the intersection over positive rational e of the sets 
in (9.37) minus the union over positive rational e of the sets in (9.36). 
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The idea of the proof is this. Write 


(9.38) p(n) = y2n loglogn . 


if A, =[S > (1 e)ġ(n)], then by (9.29), P(A*) is near (log n) CEON 
n, increases exponentially, say n, ~ 0“ for 0 > 1, then P(A Pe is of the order 
k-C +)" Now E k “+ converges if the sign is + and diverges if the sign 
is —. It will follow by the first Borel—Cantelli lemma that there is probability 
0 that A, occurs for infinitely many k. In providing (9.36), an extra 
argument is required to get around the fact that the A; for n #n, must also 
be accounted for (this requires choosing 0 near 1). If the A, were indepen- 
dent, it would follow by the second Borel—Cantelli lemma that with probabil- 
ity 1, A, occurs for infinitely many k, which would in turn imply (9.37). An 
extra argument is required to get around the fact that the A, are dependent 
(this requires choosing 0 large). 

For the proof of (9.36) a preliminary result is needed. Put M,= 


maxis o Shy = eos WHET So 


Theorem 9.6. Jf the X, are independent simple random variables with 
mean 0 and variance 1, then for a> v2. 


(9.39) ike a] <2P] 7 za- V2]. 


Proor. If A;=[M,-, <ayn <M,], then 


n—I1 


CYP 


j=1 


AA 


p[ te a| <P| 7t za- i ’2|). 


Since S, — S; has variance n —j, it follows by independence and Chebyshev’s 
inequality that the probability in the sum is at most 


|S, — S) ~ |S, — S,| 
rr ee e 
Í <1P(A,). 
Since VARIARE PM, = avn], 
1 M, 
p| eal = p| 2 3P Ja] | 
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Proor or (9.36). Given e, choose 0 so that 6>1 but 0? <1+.e. Let 
n,=|0*| and x, = 0(2 log log Ry)? By (9.29) and (9.39), 


M ai 
rk =| < 2 exp| - ate V2 ya +&)]. 


| 
a 
| Vi" 


D 
f 
r 


where €, > 0. The negative of the exponent is asymptotically 67 log k and 
hence for large k exceeds 0 log k, so that 


Since @> 1, it follows by the first Borel-Cantelli lemma that there is 
probability 0 that (see (9.38)) 


(9.40) Ma Z O(n) 
for infinitely many k. Suppose that n,_, <n <n, and that 
(9.41) S, > (1+e)ġ(n). 


Now (n) > 6(n,_,)~ 07 '/76(n,); hence, by the choice of 6, (1 + e)¢(n) > 
6(n,) if k is large enough. Thus for sufficiently large k, (9.41) implies (9.40) 
(if n,_,<n<n,), and there is therefore proability 0 that (9.41) holds for 
infinitely many n. E 


Proor oF (9.37). Given e, choose an integer 0 so large that 307 1⁄2 < e. 
Take n, = 0“. Now ną —n,_, > %, and (9.29) applies with n =n, —n,_, and 


a, =xX,/ yn, —N,_,, where x, =(1— 0~')d(n,). It follows that 


NkTNAk-I 


1 2 
> x4] = exp] - 5 H — 1 +6): 
l 


P[S,,-Sn,_,2*«] =P[S ere ree 


nk nk 


where é, > 0. The negative of the exponent is asymptotically (1 — @~')log k 
and so for large k is less than log k, in which case PIS, — Sf aale kN 
The events here being independent, it follows by the second Borel-Cantelli 
lemma that with probability 1, S,,—S, | 2>x, for infinitely many k. On the 
other hand, by (9.36) applied to {—X,}, there is probability 1 that —S, < 
2¢(n,_,) <207!¢(n,) for all but finitely many k. These two inequalities 
give S, 2 X — 207 '/*(n,) > (1 — €)(n,), the last inequality because of the 
choice of 0. Po 


That completes the proof of Theorem 9.5. 
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PROBLEMS 


9.1. 


9.2. 


9.3. 


9.4. 


95; 


9.6. 


9.7. 


Prove (6.2) by using (9.9) and the fact that cumulants add in the presence of 
independence. 


In the Bernoulli case, (9.21) gives 
P[S,=np+x,]= exp| -nK ( p a “1, p}(1 +o(1))|, 


where p <a < 1 and x, =n(a—p). Theorem 9.4 gives 


2 
P[S, =np +x, ]= an|- a (1 +o(1))}, 


where x,,=a,npq. Resolve the apparent discrepancy. Use (9.25) to compare 
the two expressions in case x,,/n is small. See Problem 27.17. 


Relabel the binomial parameter p as 6=f(p), where f is increasing and 
continuously differentiable. Show by (9.27) that the distinguishability of 6 from 
0 + A9, as measured by K, is (A@)?/8p(1 — pX f’(p))? + O(A@). The leading 
coefficient is independent of 6 if f(p) = arc sin/p i 


From (9.35) and the same result for GA together with the uniform bounded- 
ness of the X,, deduce that with probability 1 the set of limit points of the 
sequence {S,,(2n loglog n) '/7} is the closed interval from —1 to +1. 


+ Suppose X, takes the values +1 with probability > each, and show that 
P[S. =0 i.o.] = 1. (This gives still another proof of the persistence of symmetric 
random walk on the line (Example 8.6).) Show more generally that, if the X, are 
bounded by M, then P[|S,,| <M i.o.] = 1. 


Weakened versions of (9.36) are quite easy to prove. By a fourth- 
moment argument (see (6.2)), show that PIS, > n?/*(log n)! +9/4i.0.]=0. Use 
(9.29) to give a simple proof that P[S,, > (3n log n)!/*i.0.] = 0. 


Show that (9.35) is true if S„ is replaced by |S„l or max, <n Sk or max, < ,|S,l. 


CHAPTEK2Z 


Measure 


SECTION 10. GENERAL MEASURES 


Lebesgue measure on the unit interval was central to the ideas in Chapter 1. 
Lebesgue measure on the entire real line is important in probability as well 
as in analysis generally, and a uniform treatment of this and other examples 
requires a notion of measure for which infinite values are possible. The 
present chapter extends the ideas of Sections 2 and 3 to this more general 
setting. 


Classes of Sets 


The o-field of Borel sets in (0, 1] played an essential role in Chapter 1, and it 
is necessary to construct the analogous classes for the entire real line and for 
k-dimensional Euclidean space. - 


Example 10.1. Let x =(x,,...,x,) be the generic point of Euclidean 
k-space R*. The bounded rectangles 


(10.1) jam Nar o E] 


will play in R* the role intervals (a, b] played in (0, 1]. Let @* be the ø-field 
generated by these rectangles. This is the analogue of the class @ 
of Borel sets in (0,1]; see Example 2.6. The elements of &* are the k- 
dimensional Borel sets. For k = 1 they are also called the linear Borel sets. 
Call the rectangle (10.1) rational if the a; and b; are all rational. If G is an 
open set in R% and y €G, then there is a rational rectangle A, such that 
y€A,cCG. But then G=U,.,A,, and since there are only countably 
many rational rectangles, this is a countable union. Thus &* contains the 
open sets. Since a closed set has open complement, #&* also contains the 
closed sets. Just as @ contains all the sets in (0,1] that actually arise in 
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ordinary analysis and probability theory, Z% contains all the sets in R* that 
actually arise. 

Mha tee Gok y 

The o field P is generated by subclasses other than the class of 
rectangles. If A, is the x-set where a, <x; <b, +n-',i=1,...,k, then A, is 
open and ( 10.1) is N „A„. Thus B* is generated by the open sets. Similarly, it 
is generated by the closed sets. Now an open set is a countable union of 


rational oh athe Therefore, the (countable) class of rational rectangles 
generates HK q a 


The o-field #' on the line R! is by definition generated by the finite 
intervals. The o-field @ in (0, 1] is generated by the subintervals of (0, 1]. The 
question naturally arises whether the elements of Z are the elements of 2' 
that happen to lie inside (0, 1], and the answer is yes. If Z is a class of sets in 
a space Q and Q, is a subset of Q, let WAN,=[ANN,: AE YH), 


Theorem 10.1. (i) If F is a o-field in Q, then FOAM, is a o-field in Qy. 


(ii) If £ generates the o-field F in Q, then AN No generates the o-field 
FAM, in Qo: GN O19) = F627) N Do. 


Proor. Of course 0)=2NQ, lies in FN). If B lies in FA Qg, so 
that B=ANQ, for an A E€ F, then Q, - B=(Q—A) NN, lies in FN Op. 
If B, lies in FN Qg for all n, so that B,=A, N Qo for an A, © F, then 
U ,B, =(U, An) AM lies in FN Qo. Hence part (i). 

Let F, be the o-field N Ny generates in No. Since AN Ng € FN No 
and FN Q, is a o-field by part DAR EANA 

Now FN Ny C F will follow if it is shown that A € F implies A N Qo 
EF, of, to put it another way, if it is shown that FY is contained in 
$=(ACQA: AND, E J). Since A € M implies that A N Qp lies in YA Qo 
and hence in FN No, it follows that Cc FY, It is therefore enough to show 
that ¥ is a o-field in Q. Since ANN, =No lies in Fo, it follows that 
NEF. If AEF, then (A —A) NN =N- (ANN) lies in Fo and hence 
2-AeE¥%. If A, €F for all n, then (U „A AN = U (A, O Qo) lies in 
F, and hence U,,A, E F. a 


If QE F, then FAN) =A: ACQ,, AE F]. If Q=R', Q= (0,1), 


and F= F', and if Æ is the class of finite intervals on the line, then N Qg 
is the class of subintervals of (0, 1], and 2 = o (N No) is given by 


(10.2) @=(A:Ac(0,1], AE 8']. 


A subset of (0, 1] is thus a Borel set (lies in B) if and only if it is a linear 
Borel set (lies in #'), and the distinction in terminology can be dropped. 
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Conventions Involving © 


Measures assume values in the set [0,] consisting of the ordinary nonnegative reals 
and the special value ©, and some arithmetic conventions are called for. . 

For x, y €[0,~], x <y means that y =% or else x and y are finite (that is, are 
ordinary real numbers) and x <y holds in the usual sense. Similarly, x <y means that 
y =o and x is finite or else x and y are both finite and x <y holds in the usual 


sense. 
For a finite or infinite sequence x, X4, X2,... in [0, oo], 


(10.3) x= Dow 
k 


means that either (i) x = % and x, = œ for some k, or (ii) x =% and x, < for all k 
and E, x, is an ordinary divergent infinite series, or (iii) x <œ% and x, < œ for all k 
and (10.3) holds in the usual sense for L,x, an ordinary finite sum or convergent 
infinite series. By these conventions and Dirichlet’s theorem [A26], the order of 
summation in (10.3) has no effect on the sum. 

For an infinite sequence x, x,,X,... in [0, ~], 


(10.4) Bo IP 25 


means in the first place that x, <x,,,<x and in the second place that either (i) 
x < and there is convergence in the usual sense, or (ii) x, = © for some k, or (iii) 
x =o and the x, are finite reals converging to infinity in the usual sense. 


Measures 


A set function u on a field Z in Q is a measure if it satisfies these 
conditions: 


(i) w(A) €[0,~] for A E F; 
(ii) (æ) = 0; 
(iii) if A,, A,,... is a disjoint sequence of #sets and if UŞ 14, E F; 
then (see (10.3)) 


e| U a, E 2 u(A,). 
k=1 k=1 


The measure p is finite or infinite as p(Q)<œ or w(Q) =; it is a 
probability measure if (4) = 1, as in Chapter 1. 4 

If Q=A,UA,U... for some finite or countable sequence of #sets — 
satisfying u(A,) < %, then u is o-finite. The significance of this concept will — 
be seen later. A finite measure is by definition o-finite; a o-finite measure 
may be finite or infinite. If 2 is a subclass of F, then u is o-finite on £ a 
Q = U A, for some finite or infinite sequence of sets satisfying w(A,) < 
oo, It is not required that these sets A, be disjoint. Note that if Q is not a 
finite or countable union of sets, then no measure can be o-finite on . It 
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is important to understand that o-finiteness is a joint property of the space 
Q, the measure u, and the class .07. 

If u is a measure on a o-field F in Q, the triple (Q, F, u) is a measure 
space. (This term is not used if Z is merely a field.) It is an infinite, a 
o-finite, a finite, or a probability measure space according as pw has the 
corresponding property. If u( A‘) = 0 for an Æset A, then A is a support of 
u, and u is concentrated on A. For a finite measure, A is a support if and 
only if uw(A) = n(Q). 

The pair (Q, F ) itself is a measurable space if Z is a o-field in Q. To say 
that u is a measure on (Q, Z ) indicates clearly both the space and the class 
of sets involved. 

As in the case of probability measures, (iii) above is the condition of 
countable additivity, and it implies finite additivity: If A,,..., A, are disjoint 
“sets, then 


a| U A, | = Z u(A,). 
k=1 k=1 


As in the case of probability measures, if this holds for n = 2, then it extends 
inductively to all n. 


Example 10.2. A measure u on (Q, F) is discrete if there are finitely or 
countably many points w; in Q and masses m; in [0,0] such that w(A)= 
L, 4M; for A€ F. It is an infinite, a finite, or a probability measure as 
Em ; diverges, or converges, or converges to 1; the last case was treated in 
Example 2.9. If Z contains each singleton {w;}, then u is ø-finite if and only 
if m; < œ for all i. a 


Example 10.3. Let F be the o-field of all subsets of an arbitrary Q, and 
let u( A) be the number of points in A, where p( A) =© if A is not finite. 
This u is counting measure; it is finite if and only if Q is finite, and is o-finite 
if and only if Q is countable. Even if Z does not contain every subset of Q, 
counting measure is well defined on F. LA 


Example 10.4. Specifying a measure includes specifying its domain. If u 
is a measure on a field FY and A, is a field contained in F, then the 
restriction 4o of u to A, is also a measure. Although often denoted by the 
same symbol, py is really a different measure from u unless A, = F. Its 
properties may be different: If 4 is counting measure on the o-field F of all 
subsets of a countably infinite Q, then u is o-finite, but its restriction to the 


o-field A, = {Ø, Q} is not o-finite. 2 | 


4 E p Š 
Em 
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Certain properties of probability measures carry over immediately to the 
general case. First, u is monotone: u(A) < p(B) if A CB. This is derived, 
just like its special case (2.5), from u(A)+ u(B-A)= p(B). But it is 
possible to go on and write u(B — A) = u(B)- CA) only if u(B)< œ, If 
u(B)=œ and u(A)< œ, then u(B — A) = œ; but for every a € [0,œ%] there 
are cases where u( 4) = u(B) = œ and u(B — A) =a. The inclusion-exclusion 
formula (2.9) also carries over without change to -sets of finite measure: 


n 


U A] z La(A;) z L u( A; A;) fies 


k=1 i<j 


(10.5) u 


A 21) O 


The proof of finite subadditivity also goes through just as before: 
e| U A, | DD A) 
k=1 k=1 


here the A, need not have finite measure. 
Theorem 10.2. Let u be a measure ona field F. 


(i) Continuity from below: If A, and A lie in F and A,,} A, then’ 
u(A,,)T WA). 

(ii) Continuity from above: If A, and A lie in F and A, A, and if 
u(A,)<~, then p(A,,) 1 wCA). 

(iii) Countable subadditivity: If A,, A,,... and UZ_,A, lie in F, then 


„| U A, sD, lA 
k=1 k=1 

(iv) If is o-finite on F, then F cannot contain an uncountable, disjoint 
collection of sets of positive js-measure. 


Proor. The proofs of (i) and (iii) are exactly as for the corresponding 
parts of Theorem 2.1. The same is essentially true of (ii): If w(A,)<® 
subtraction is possible and A, —A, +t A,—A implies that u(A,)—p(A,)= 
uA, —A,,) Tt WA, — A) = (A,) — pA). 

There remains (iv). Let [B,: 6€@] be a disjoint collection of #sets 
satisfying u(B,) > 0. Consider an Æset A for which (A) < œ. If 0,-.-,9, 
are distinct indices satisfying (A N B,,) > € > 0, then ne < L7_,u(A N Ba) 
< (A), and so n < u(A)/e. Thus the index set [0: u(A N Bp) >] is finite, 


tSee (10.4). 
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and hence (take the union over positive rational e) [0: u(AB,) > 0] is 
countable. Since u is o-finite, ©=U,A, for some finite or countable 
sequence of Asets A, satisfying u(A,) < œ. But then ©, =[6: u(A,NB,) 
> 0] is countable for each k. Since u(B,)>0, there is a k for which 
uCA, N Bo) > 0, and so © = U, @,: O is indeed countable. bal 


Uniqueness 


According to Theorem 3.3, probability measures agreeing on a m-system Z 
agree on o(#). There is an extension to the general case. 


Theorem 10.3. Suppose that u, and u, are measures on o( FP), where P 


is a m-system, and suppose they are o-finite on P. If w, and u, agree on F, 
then they agree on o( P2). 


Proor. Suppose that Be Z and u(B) = uB) <, and let 2 be 
the class of sets A in o(#) for which 4,(B NA) = (B.A). Then % isa 
A-system containing # and hence (Theorem 3.2) containing o(P). 

By o-finiteness there exist Asets B, satisfying = U, B, and w,(B,) = 
.>(B,) < œ. By the inclusion-exclusion formula (10.5), 


Ha U(B.n4)] = D ALBANA) = y Hal B; NB; NA) +-+: 


i=1 1<i<n 1<i<j<n 


for a = 1,2 and all n. Since # is a m-system containing the B,, it contains 
the B; N B,, and so on. For each o(#)-set A, the terms on the right above 
are therefore the same for a = 1 as for a = 2. The left side is then the same 
for a = 1 as for a = 2; letting n > œ gives 4 ,(A) =n ,( A). 5 


Theorem 10.4. Suppose u, and p, are finite measures on o(P), where P 
is a t-system and Q. is a finite or countable union of sets in P. If u, and p, 
agree on P, then they agree on o(P). 


Proor. By hypothesis, Q = U, B, for Asets B,, and of course 
KaB) <p,(Q) <%, a=1,2, Thus u, and u, are o-finite on #, and 
Theorem 10.3 applies. a 


Example 10.5. \f Z consists of the empty set alone, then it is a m-system 
and o(F) = {Ø, Q}. Any two finite measures agree on #, but of course they 
need not agree on o(#). Theorem 10.4 does not apply in this case, because 
Q is not a countable union of sets in #. For the same reason, no measure on 
o(P) is o-finite on Z, and hence Theorem 10.3 does not apply. a 
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Example 10.6. Suppose that (Q, F )=(R!, 2!) and P consists of the 
half-infinite intervals (— œ, x]. By Theorem 10.4, two finite measures on F 
that agree on # also agree on F. The Asets of finite measure required in 
the definition of o-finiteness cannot in this example be made disjoint. a 


Example 10.7. If a measure on (Q, ¥ ) is o-finite on a subfield A, of F, 
then Q = U, B, for disjoint Fo-sets B, of finite measure: if they are not 
disjoint, replace B, by B, N By +++ OB }. a 


The proof of Theorem 10.3 simplifies slightly if Q = U, Bẹ for disjoint 
PAsets with w(B,)=u,(B,) <œ, because additivity itself can be used in 
place of the inclusion-exclusion formula. 


PROBLEMS 


10.1. Show that if conditions (i) and (iii) in the definition of measure hold, and if 
u(A)<o for some A E€ F, then condition (ii) holds. 


10.2. On the o-field of all subsets of Q = {1,2,...} put w(A)=L,— 42 “ if A is 
finite and u( A) = œ otherwise. Is u finitely additive? Countably additive? 


10.3. (a) In connection with Theorem 10.2(ii), show that if A, | A and w(A,)< 
for some k, then u(A,)) uA). 
(b) Find an example in which A, | A, u(A,) =~, and A = Ø. 


10.4. The natural generalization of (4.9) is 
(10.6) (lim inf A, ) < lim inf u(4,) 
n n 
< lim sup »(A,,) < u (lim sup A,). 
n n 


Show that the left-hand inequality always holds. Show that the right-hand 
inequality holds if (U; >„ A) < œ for some n but can fail otherwise. 


10.5. 3.10 A measure space (Q, F, u) is complete if ACB, BE F, and u(B)=0 
together imply that A E€ A—the definition is just as in the probability case. Use 
the ideas of Problem 3.10 to construct a complete measure space (Q, F*, u+) 
such that FC F* and u and p* agree on F. 


10.6. The condition in Theorem 10.2(iv) essentially characterizes o-finiteness. 
(a) Suppose that (Q, F, u) has no “infinite atoms,” in the sense that for every 
A in F, if (A) =~, then there is in F a B such that BCA and 0 < p(B) < %. 
Show that if F does not contain an uncountable, disjoint collection of sets of 
positive measure, then y is o-finite. (Use Zorn’s lemma.) 


(b) Show by example that this is false without the condition that there are no 
infinite atoms.” 


10.7. Example 10.5 shows that Theorem 10.3 fails without the o-finiteness condition. 
Construct other examples of this kind. 
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Outer Measure 


An outer measure is a set function u* that is defined for all subsets of a space 
Q and has these four properties: 


(i) u*(A) € [0, œ] for every A c Q; 
(ii) w*(O) = 0; 
(iii) u* is monotone: A CB implies u*(A) < u*(B); 
(iv) u* is countably subadditive: u*(U„ A,) < L,,u*(A,). 


rhe set function P* defined by (3.1) is an example, one which generalizes: 


Example 11.1. Let p be a set function on a class & in Q. Assume that 
Oe & and p(@)=0, and that p(A)€[0,~] for AEX; p and & are 
otherwise arbitrary. Put 


(11.1) u*(A) = inf } p(4,), 


where the infimum extends over all finite and countable coverings of A by 
sets A,. If no such covering exists, take u*( A) = œ in accordance with the 
convention that the infimum over an empty set is ©. 

That p* satisfies (i), (ii), and (iii) is clear. If u*(A,,) = for some n, then 
obviously u*(U,, A,,) < E„u*(A„). Otherwise, cover each A, by o4sets B, 
satisfying L,p(B,,) < »*(A,) + €/2", then w*(U, A,) < ¥, ,o(B,,) < 
L,*(A,) + e€. Thus p* is an outer measure. Py 


Define A to be p*-measurable if 
(11.2) u" (A NEYT H(A NE) =p*(E£) 


for every E. This is the general version of the definition (3.4) used in Section 
3. By subadditivity it is equivalent to 


(11.3) (ANE) +Kn"( AoE) su*( 2). 


Denote by H(*) the class of «4*-measurable sets. 

The extension property for probability measures in Theorem 3.1 was 
proved by a sequence of lemmas the first three of which carry over directly to 
the case of the general outer measure: If P* is replaced by u* and @ by 
M1*) at each occurrence, the proofs hold word for word, symbol for symbol. 
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In particular, an examination of the arguments shows that © as a possible 
value for u* does not require any changes. Lemma 3 in Section 3 becomes 
this: 


Theorem 11.1. If u* is an outer measure, then M(u*) is a o-field, and u* 
restricted to M(u*) is a measure. 


This will be used to prove an extension theorem, but it has other 
applications as well. 


Extension 


Theorem 11.2. A measure on a field has an extension to the generated 
o-field. 


If the original measure on the field is o-finite, then it follows by Theorem 
10.3 that the extension is unique. 

Theorem 11.2 can be deduced from Theorem 11.1 by the arguments used 
in the proof of Theorem 3.1.* It is unnecessary to retrace the steps, however, 
because the ideas will appear in stronger form in the proof of the next result, 
which generalizes Theorem 11.2. 

Define a class Y of subsets of Q to be a semiring if 


(i) DEW; 
(ii) A, BE Y implies A N B E€ Æ; 


(ii) if A, BE LM and ACB, then there exist disjoint sets Cis sas Oe 
such that B—A = Uf_,C,. 


The class of finite intervals in Q = R! and the class of subintervals of 


Q = (0, 1] are the simplest examples of semirings. Note that a semiring need 
not contain Q. 


Theorem 11.3. Suppose that u is a set function on a semiring of. Suppose 
that u has values in (0,~], that (Ø) = 0, and that u is finitely additive and 
countably subadditive. Then u extends to a measure on a( 7). 


This contains Theorem 11.2, because the conditions are all satisfied if 7 
is a field and yw is a measure on it. If Q = Uk A, for a sequence of sets 


satisfying “(A,) < œ, then it follows by Theorem 10.3 that the extension is 
unique. 


See also Problem 11.1. 


SECTION 11. OUTER MEASURE 167 


Proor. If A, B, and the C, are related as in condition (iii) in the 
definition of semiring, then by finite additivity w(B) = w(A) + LZ_ ,u(C,) = 
u( A). Thus u is monotone. 

Define an outer measure u* by (11.1) for p =p: 


(11.4) w*( A) = inf )u(A,), 


the infimum extending over coverings of A by sets. 

The first step is to show that Zc M(u*). Suppose that A E€. Æ. If 
u*CE) = œ, then (11.3) holds trivially. If u*(E) < œ, for given e choose 2Zsets 
4,, such that ECU, A, and L,u(A,) <u*(E) + e. Since Y is a semiring, 
B, =ANA, lies in & and A°NA,=A,—B, has the form U7Z2,C,, for 
disjoint sets C,,. Note that A, =B,U UZ2,C,,, where the union is 
disjoint, and that ANE CU, B, and ASN ECU, UZ,C,,. By the defini- 
tion of u* and the assumed finite additivity of u, 


uM(ANE) +X ANE) < EuB.) + E E A 
n k=1 


= KAC <H (EJ RE 


Since e is arbitrary, (11.3) follows. Thus 2c (u*). 

The next step is to show that u* and w agree on Æ. If ACU, A, for 
sets A and A,, then by the assumed countable subadditivity of u and the 
monotonicity established above, uw(A)<%,u(ANA,)<¥,u(A,). There- 
fore, A&W implies that w(A) <p*(A) and hence, since the reverse in- 
equality is an immediate consequence of (11.4), u(A)=p*(A). Thus p* 
agrees with u on &. 

Since ZC M(p*) and H(yu*) is a o-field (Theorem 11.1), 


AC 0( a0) CM (p*) C2. 


Since y* is countably additive when restricted to .#(u*) (Theorem 11.1 
again), u* further restricted to o(.%) is an extension of u on 2, as required. 
ag 


Example 11.2. For & take the semiring of Weare of Q = (0,1] 
(together with the empty set). For u take length A: A(a, b] = b — a. The finite 
additivity and countable subadditivity of A follow by fed tale 1.3.7 By 
Theorem 11.3, A extends to a measure on the class o(.&%) = @ of Borel sets 
in (0, 1]. a 


‘On a field, countable additivity implies countable subadditivity, and A is in fact c > 
additive on o—but 2 is merely a semiring. Hence the separate consideration of addi vity ; 
subadditivity; but see Problem 11.2. ; 
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This gives a second construction of Lebesgue measure in the unit interval. 
In the first construction A was extended first from the class of intervals to the 
field @, of finite disjoint unions of intervals (see Theorem 2.2) and then by 
ice 11.2 (in its special form Theorem 3.1) from Z) to Z= o(Bp). 
Using Theorem 11.3 instead of Theorem 11.2 effects a slight economy, since 
the extension then goes from .9% directly to @ without the intermediate stop 
at B, and the arguments involving (2.13) and (2.14) become unnecessary. 


Example 11.3. In Theorem 11.3 take for & the semiring of finite inter- 
vals on the real line R', and consider A,(a,b]=b—a. The arguments for 
Theorem 1.3 in no way require that the (finite) intervals in question be 
contained in (0, 1], and so A, is finitely additive and countably subadditive on 
this class Z. Hence A, extends to the o-field 2' of linear Borel sets, which 
is by definition generated by . This defines Lebesgue measure A, over the 
whole real line. B 


A subset of (0, 1] lies in @ if and only if it lies in A’ (see (10.2)). Now 
A,(A)=A(A) for subintervals A of (0.1], and it follows by uniqueness 
(Theorem 3.3) that A,(A) = ACA) for all A in Ø. Thus there is no inconsis- 
tency in dropping A, and using A to denote Lebesgue measure on PZ! as well 
as on Z. 


Example 11.4. The class of bounded rectangles in R* is a semiring, a fact 
needed in the next section. Suppose that A =[x: x; €I, i <k] and B=[x: 
x;E€J, i <k] are nonempty rectangles, the J; and J; being finite intervals. If 
A CB, then [,CJ;, so that J;—J; is a disjoint union J; UJ’ of intervals 
(possibly aoe) Consider the 3k disjoint rectangles [x: x; € U,, i < k], where 
for each i, U; is J; or I; or I”. One of these rectangles is A itself, and B — A 
is the union of the others. The rectangles thus form a semiring. a 


An Approximation Theorem 


If L is a semiring, then by Theorem 10.3 a measure on o(.9/) is determined 
by its values on Æ if it is o-finite there. Theorem 11.4 shows more explicitly 
how the measure of a o(.%)-set can be approximated by the measures of 
sets. 


Lemma 1. If A, A,,...,A, are sets in a semiring Æ, then there are 
disjoint sets C,,...,C,, such that 


ANAIN + KAS a OU Ue 


m` 


Proor. The case n = 1 follows from the definition of semiring applied to 
ANAS =A —(ANA,). If the result holds for n, then A NASN --- NAh+i 
= (C, Ae +1); apply the case n = 1 to each set in the union. a 


; 
+ 
Se 
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Theorem 11.4. Suppose that Æ is a semiring, u is a measure on F= 0( LH), 
and u is o-finite on X. 


(i) If BE F and e> 0, there exists a finite or infinite disjoint sequence 
A,, A>,... of Æsets such that BC U, A, and w(U, A,) — B) <€. 

(ii) If BE F and e> 0, and if u(B) < œ, then there exists a finite disjoint 
sequence A,,..., A,, of sets such that u(Ba(UZ_, A,)) <e. 


PrRoor. Return to the proof of Theorem 11.3. If u* is the outer measure 
defined by (11.4), then FC AM(u*) and p* agrees with uw on Æ, as was 
shown. Since u* restricted to Z is a measure, it follows by Theorem 10.3 
that u* agrees with u on Z as well. 

Suppose now that B lies in Z and p(B) = p*(B) < œ. There exist sets 
A, such that BCU, A, and (U, A,) SE ul AR) < CB) + €; but then 
u((U, A,) —B)<e. To make the sequence {A,} disjoint, replace A, by 
A, AAS --- AA; by Lemma 1, each of these sets is a finite disjoint 
union of sets in Z. 

Next suppose that B lies in Z and u(B)= u*(B)= œ. By o-finiteness 
there exist sets C,, such that Q = U,,C,, and u(C,,) < ©. By what has just 
been shown, there exist sets A,, such that BOC, CU,A,, and 
uCU, Ame) — (BOC,,)) <€/2™. The sets Amg taken all together provide a 
sequence A,, A,,... of sets satisfying B C U, A, and w((U, A,) — B) <e. 
As before, the A, can be made disjoint. 

To prove part (ii), consider the A, of part (i). If B has finite measure, so 
has A= U,A,, and hence by continuity from above (Theorem 10.2C(ii)), 
u(A — U; -n Ap) <e for some n. But then PCB ACO: A,)) <2e. = 


If, for example, B is a linear Borel set of finite Lebesgue measure, then 


A(B a (U}ž_14,) <e for some disjoint collection of finite intervals 
Ay HERA 


n° 


Corollary 1. Jf u is a finite measure on a o-field F generated by a field Fy, then 
for each F-set A and each positive e there is an F,-set B such that u( A 4B) <e. 


Proor. This is of course an immediate consequence of part (ii) of the theorem, 
but there is a simple direct argument. Let Y be the class of #sets with the required 
property. Since A©°sB°=AaB, #& is closed under complementation. If A = U,„ A,, 
where A, E€ Ẹ, given e choose ny so that w(A — Un <n, An) <€, and then choose 
Foysets B,, n <no, so that w(A,4B,)<e/no. Since (Un <n, 4n)4(U,, ae 
U,<n,(A,48B,), the Fo-set B= U, <n, B, satisfies u(A aB) < 2e. Of course FCF: 
since Y is a o-field, FC FY, as required. a 


Corollary 2. Suppose that Æ is a semiring, Q is a countable union of 2Æsets, and 
His H are measures on F=o(L). If w(A) <p A) <@ for AE, then u(B)< 
u(B) forBe F. 
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Proor. Since pu, is o-finite on 0%, the theorem applies. If 42(B) < œ, choose 
disjoint sets A, such that BC U, A, and L,p(A;,) < p(B) + €. Then p(B) < 
EAA) < Dy (A,) < wp (B) + €. | 


A fact used in the next section: 


Lemma 2. Suppose that u is a nonnegative and finitely additive set function on a 
semiring X, and let A, A,,..., A, be sets in Æ. 


(i) If U? A, CA and the A, are disjoint, then Li_,n(A;) < HCA). 
(ii) If AC UN, A; then CA) < E} n(A;). 


Proor. For part (i), use Lemma 1 to choose disjoint sets C; such that 
A — U?_, A; = Uji, C;. Since pw is finitely additive and nonnegative, it follows that 
For (ii), take B, =A NA, and B,=ANA;NA{N +++ NA;{_, for i > 1. By Lemma 
1, each B, is a finite disjoint union of sets C,;. Since the B; are disjoint, 
A = U; B;= U; C; and U;C;; CA; it follows by finite additivity and part (i) that 
ul A) = EL jm a < Lul A;). a 


Compare Theorem 1.3. 


PROBLEMS 


11.1. The proof of Theorem 3.1 obviously applies if the probability measure is 
replaced by a finite measure, since this is only a matter of rescaling. Take as a 
starting point then the fact that a finite measure on a field extends uniquely to 
the generated o-field. By the following steps, prove Theorem 11.2—that is, 
remove the assumption of finiteness. 


(a) Let u be a measure (not necessarily even ø-finite) on a field Fo, and let 
F=o(SA,). If A is a nonempty set in F} and u(A)< œ, restrict u to a finite 
measure p4 on the field Fo A, and extend u4 to a finite measure ji, on the 
o-field ANA generated in A by ANA. 

(b) Suppose that E € F. If there exist disjoint F,-sets A, such that E C U, Ap 
and u(A,)<~, put #(E)=2,,4,(EMA,) and prove consistency. Otherwise 
put (E) =~, 

(c) Show that Å is a measure on F and agrees with u on F. 


11.2. Suppose that yz is a nonnegative and finitely additive set function on a semi- 
ring . 


(a) Use Lemmas 1 and 2, without reference to Theorem 11.3, to show that p is 
countably subadditive if and only if it is countably additive. 


(b) Find an example where p is not countably subadditive. 
11.3. Show that Theorem 11.4(ii) can fail if w(B) = œ. 


11.4. This and Problems 11.5, 16.12, 17.12, 17.13, and 17.14 lead to proofs of the 
Daniell-Stone and Riesz representation theorems. 
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Let A be a real linear functional on a vector lattice -7 of (finite) real 
functions on a space ©. This means that if f and g lie in 7, then so do f V g 
and f A g (with values max{f(w), g(w)} and min{f(w), g(w)}), as well as af + Bg, 
and A(af + Bg)=aA(f) + BA(g). Assume further of 7 that f= 7 implies 
fA 1€7 (where 1 denotes the function identically equal to 1). Assume further 
of A that it is positive in the sense that f > 0 (pointwise) implies A( f ) > 0 and 
continuous from above at 0 in the sense that f,, | 0 (pointwise) implies A( f„) > 0. 


(a) If f<g (f,g€~%), define in Q x R! an “interval” 


(11.5) (f,g]=[(o,t): f(w) <t<g(o)]. 


Show that these sets form a semiring %. 
(b) Define a set function vo on Y) by 


(11.6) vo fg] =A(s-f). 
Show that vọ is finitely additive and countably subadditive on %. 


11.5. tT (a) Assume fe. and let f, =(n(f -JADA L Show that 
f(w) <1 implies f,(@)=0 for all n and f(w)>1 implies f,(w)=1 for all 
sufficiently large n. Conclude that for x > 0, 


(Aad 7) O ME fi(@) eal Okar 


(b) Let Z be the smallest o-field with respect to which every f in Z@ is 
measurable: F=o[f 'H: fE, HE A’). Let F be the class of A in F for 
which A x (0, 1] E o(.%). Show that Fo is a semiring and that F= o (Fy). 


(c) Let v be the extension of vy (see (11.6)) to o(.%), and for A E F} define 
(A) = (A x (0, 1]). Show that uo is finitely additive and countably subaddi- 
tive on the semiring Fp. 
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Lebesgue Measure 


In Example 11.3 Lebesgue measure A was constructed on the class 2' of 
linear Borel sets. By Theorem 10.3, A is the only measure on #' satisfying 
Ma,b|=b-—a for all intervals. There is in k-space an analogous k- 
dimensional Lebesgue measure i, on the class @* of k-dimensional Borel 


sets (Example 10.1). It is specified by the requirement that bounded rectan- 
gles have measure 


k 
(12.1) A, (x: a, <x), f= 1, ke Tae), 
i=1 


This is ordinary volume—that is, length (k = 1), area (k = 2), volume (k = 3), 
or hypervolume (k > 4). 
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Since an intersection of rectangles is again a rectangle, the uniqueness 
theorem shows that (12.1) completely determines A,. That there does exist 
such a measure on &* can be proved in several ways. One is to use the ideas 
involved in the case k = 1. A second construction is given in Theorem 12.5. A 
third, independent, construction uses the general theory of product mea- 
sures: this is carried out in Section 18.‘ For the moment, assume the 
existence on &* of a measure A, satisfying (12.1). Of course, A, is o-finite. 

A basic property of A, is translation invariance.* 


Theorem 12.1. If 4€.A*, then A+x=[a+x: a EA]JE Z: and 
A,(A) =A,(A +x) for all x. 


Proor. If isthe class of A such that A +x isin A* for all x, then J 
is a o-field containing the bounded rectangles, and so YD AP. Thus A+xe 
RB for A ERA 

For fixed x define a measure u on A* by u(A)=A,(A +x). Then y and 
A, agree on the z-system of bounded rectangles and so agree for all Borel 
sets. B 


If A is a (k — 1)-dimensional subspace and x lies outside A, the hyper- 
planes A + tx for real t are disjoint, and by Theorem 12.1, all have the same 
measure. Since only countably many disjoint sets can have positive measure 
(Theorem 10.2(iv)), the measure common to the A + tx must be 0. Every 
(k — 1)-dimensional hyperplane has k-dimensional Lebesgue measure 0. 

The Lebesgue measure of a rectangle is its ordinary volume. The following 
theorem makes it possible to calculate the measures of simple figures. 


Theorem 12.2. If T: R% >R* is linear and nonsingular, then A&B 
implies that TA E B* and 


(12.2) A,(TA) =|det T|-A,( A). 


Since a parallelepiped is the image of a rectangle under a linear transfor- 
mation, (12.2) can be used to compute its volume. If T is a rotation or a 
reflection—an orthogonal or a unitary transformation—then det T= +1, 
and so A,(TA)=A,(A). Hence every rigid transformation or isometry (an 
orthogonal transformation followed by a translation) preserves Lebesgue 
measure. An affine transformation has the form Fx = Tx +x, (the general 


tSee also Problems 17.14 and 20.4. 
tAn analogous fact was used in the construction of a nonmeasurable set on p. 45. 
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linear transformation T followed by a translation); it is nonsingular if T is. It 
follows by Theorems 12.1 and 12.2 that A,(FA)=|det T|: A,(A) in the 
nonsingular case. 


PROOF OF THE THEOREM. Since T U, A, = U, TA, and TA‘ = (TAF 
because of the assumed nonsingularity of T, the class =[A: TA € A*] is a 
o-field. Since TA is open for open A, it follows again by the assumed 
nonsingularity of T that Y contains all the open sets and hence (Example 
10.1) all the Borel sets. Therefore, A € 2% implies TA € A. 

For Ae #*, set w,(A)=A,(TA) and (A) =|det T|-A,(A). Then p, 
and u, are measures, and by Theorem 10.3 they will agree on .B* (which is 
the assertion (12.2)) if they agree on the m-system consisting of the rectangles 
[x: a;<x,;<b,;, i=1,...,k] for which the a; and the b; are all rational 
(Example 10.1). It suffices therefore to prove (12.2) for rectangles with sides 
of rational length. Since such a rectangle is a finite disjoint union of cubes 
and A, is translation-invariant, it is enough to check (12.2) for cubes 


(12.3) Al || 2 Oe 5c 


that have their lower corner at the origin. 
Now the general T can by elementary row and column operations’ be 


represented as a product of linear transformations of these three special 
forms: 


(1°) 1Mx,,..+,%4) = Oia, ++; Sag) WHER am iswalpenmmatation of the set 
ee, eens 

(2°) TUX poes Xp) = Ce, koa Com: 

(3°) IEC Prmnree NE BAI oA nn ef) 


Because of the rule for multiplying determinants, it suffices to check (12.2) 
for T of these three forms. And, as observed, for each such T it suffices to 
consider cubes (12.3). 

(1°): Such a T is a permutation matrix, and so det T= +1. Since (12.3) is 
invariant under T, (12.2) is in this case obvious. 

(2°): Here det T=a, and TA =[x: x, EH, 0<x,;<¢, i=2,...,k], where 
H=(0,ac] if a>0, H ={0} if a=0 (although @ cannot in fact be 0 if T is 
wen and H=[ac,0) if a <0. In each case, \,(TA) =lal-ck = lal - 
AÀA). 


T BIRKHOFF & Mac Lane, Section 8.9. 
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(3°): Here det T= 1. Let B=[x: 0 <x, <c, i=3,...,k], where B= R“ if 
k <3, and define 


B, = (270 <s, Sry Scie; 
B =|%:0<x,<%, Scie, 
B, = [xie <x, S¢ +485, OR Xp Seo 


Then A = B, UB,, TA = B, U B,, and B, + (c,0,...,0) = B3. Since A,(B,) = 
à,(B,) by translation invariance, (12.2) follows by additivity. a 


If T is singular, then det 7=0 and TA lies in a (k — 1)-dimensional subspace. 
Since such a subspace has measure 0, (12.2) holds if A and TA lie in @*. The 
surprising thing is that A € Z% need not imply that TA € B* if T is singular. Even 
for a very simple transformation such as the projection T(x), x2) = (41,0) in the 
plane, there exist Borel sets A for which TA is not a Borel set. 


Regularity 


Important among measures on &* are those assigning finite measure to 
bounded sets. They share with A, the property of regularity: 


Theorem 12.3. Suppose that u is a measure on B* such that (A) < wif 
A is bounded. 


(i) For A E B* and e>0, there exist a closed C an open G such that 
CcAcGand WG —C)<e. 


(ii) If (A) <œ, then u(A) = sup u(K), the supremum extending over the 
compact subsets K of A. 


Proor. The second part of the theorem follows form the first: w(A) < © 
implies that 4(A —A,)<e for a bounded subset A, of A, and it then 
follows from the first part that w(A,—K)<e for a closed and hence 
compact subset K of Apo. 


TSee HausporrF, p. 241. 


J 
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To prove (i) consider first a bounded rectangle A =[x: a; <x; <b; i <k]. 
The set G, =[x: a; <x; <b, +n~', i <k] is open and G, | A. Since u(G,) is 
finite by hypothesis, it follows by continuity from above that u(G, —A) <e 
for large n. A bounded rectangle can therefore be approximated from the 
outside by open sets. 

The rectangles form a semiring (Example 11.4). For an arbitrary set A in 
2*, by Theorem 11.4(i) there exist bounded rectangles A, such that AC 
U, A, and w(U, A,)—A)<e. Choose open sets G, such that A, CG, 
and w(G, —A,) < €/2*. Then G = U, G, is open and (G — A) < 2e. Thus 
the general k-dimensional Borel set can be approximated from the outside by 


open sets. To approximate from the inside by closed sets, pass to comple- 
ments. a 


Specifying Measures on the Line 


There are on the line many measures other than A that are important for 
probability theory. There is a useful way to describe the collection of all 
measures on 2' that assign finite measure to each bounded set. 

If u is such a measure, define a real function F by 


u(0,x] if x>0, 


12.4 F = 
te He o Meee 

It is because w( A) < œ for bounded A that F is a finite function. Clearly, F 
is nondecreasing. Suppose that x, | x. If x >0, apply part (ii) of Theorem 


10.2, and if x < 0, apply part (i); in either case, F(x, F(x) follows. Thus F 
is continuous from the right. Finally, 


(12.5) (a, b] = F(b) — F(a) 


for every bounded interval (a, b]. If u is Lebesgue measure, then (12.4) gives 
FoS 

The finite intervals form a -system generating 2', and therefore by 
Theorem 10.3 the function F completely determines u through the relation 
(12.5). But (12.5) and u do not determine F: if F(x) satisfies (12.5), then so 
does F(x) +c. On the other hand, for a given u, (12.5) certainly determines 
F to within such an additive constant. 


For finite u, it is customary to standardize F by defining it not by (12.4) 
but by 


(12.6) F(x) =y(-%, x]; 
then lim, _._. F(x) =0 and lim, F(x)=(R'). If p is a probability 


measure, F is called a distribution function (the adjective cumulative is 
sometimes added). 
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Measures u are often specified by means of the function F. The following 
theorem ensures that to each F there does exist a p. 


Theorem 12.4. If F is a nondecreasing, right-continuous real function on 
the line. there exists on @' a unique measure p satisfying (12.5) for all a 
and b. 


As noted above, uniqueness is a simple consequence of Theorem 10.3. The 
proof of existence is almost the same as the construction of Lebesgue 
measure. the case F(x) =x. This proof is not carried through at this point, 
because it is contained in a parallel, more general construction for k- 
dimensional space in the next theorem. For a very simple argument establish- 
ing Theorem 12.4, see the second proof of Theorem 14.1. 


Specifying Measures in R* 
The o-field &* of k-dimensional Borel sets is generated by the class of 
bounded rectangles 


(12-7) A =x a, <x, SDn veer 


(Example 10.1). If J; = (a;,b;], A has the form of a Cartesian product 
(12.8) ASX og Ge 

Consider the sets of the special form 

(12.9) S,= ly: yx 17 =a 


S, consists of the points “southwest” of x =(x,,...,x,); in the case k = 1 it 
is the half-infinite interval (— œ, x]. Now S, is closed, and (12.7) has the form 


(1240) Am So iy Sarao See USt ele 


Therefore, the class of sets (12.9) generates A&*. This class is a m-system. 

The objective is to find a version of Theorem 12.4 for k-space. This will in 
particular give k-dimensional Lebesgue measure. The first problem is to find 
the analogue of (12.5). 

A bounded rectangle (12.7) has 2% vertices—the points x =(x,,...,*,) 
for which each x; is either a, or b;. Let sgn, x, the signum of the vertex, be 
qi1-or —1,; according as the number of i (1 <i < k) satisfying x; =a; is even 
or odd. For a real function F on R*, the difference of F around the vertices 
of A is A,F = LUsgn, x: F(x), the sum extending over the 2* vertices x of 
A. In the case k= 1, A =(a,b] and A,F = F(b) — F(a). In the case k =2, 
A ,F = F(b,, b,) — F(b,, a.) — Flap b2) + Flapa). 
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Since the k-dimensional analogue of (12.4) is complicated, suppose at first 


that u is a finite measure on #&* and consider instead the analogue of (12.6), 
namely 


(12.11) F(X) = wl yt 9, Sst = Tye al 


Suppose that S$, is defined by (12.9) and A is a bounded rectangle (12.7). 


Then 


lo see this, apply to the union on the right in (12.10) the inclusion-exclusion 
formula (10.5). The k sets in the union give 2“ — 1 intersections, and these 
ire the sets $, for x ranging over the vertices of A other than (b,,...,b,). 
Taking into account the signs in (10.5) leads to (12.12). 


Suppose x”) | x in the sense that x{") | x; as n > œ for each i=1,...,k. 
Then S,m | S,, and hence F(x“) > F(x) by Theorem 10.2(ii). In this sense, 
F is continuous from above. 


Theorem 12.5. Suppose that the real function F on R* is continuous from 
above and satisfies A,F =0 for bounded rectangles A. Then there exists a 
unique measure u on #* satisfying (12.12) for bounded rectangles A. 


The empty set can be taken as a bounded rectangle (12.7) for which a, = b; 
for some i, and for such a set A, A,F = 0. Thus (12.12) defines a finite-val- 
ued set function u on the class of bounded rectangles. The point of the 
theorem is that u extends uniquely to a measure on #*. The uniqueness is 
an immediate consequence of Theorem 10.3, since the bounded rectangles 
form a 7r-system generating 2“. 

If F is bounded, then y will be a finite measure. But the theorem does not 
require that F be bounded. The most important unbounded F is F(x) = xa 
‘++ Xp. Here A, F = (b, —a,)-+-(b,—a,) for A given by (12.7). This is the 
ordinary volume of A as specified by (12.1). The corresponding measure 
extended to &* is k-dimensional Lebesgue measure as described at the 
beginning of this section. 
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PRoor of THEOREM 12.5. As already observed, the uniqueness of the extension 
is easy to prove. To prove its existence it will first be shown that u as defined by 
(12.12) is finitely additive on the class of bounded rectangles. Suppose that each side 
I; =(a,,b,] of a bounded rectangle (12.7) is partitioned into n; subintervals J,; = 


(t; 5-35 tyi y = 1,..., Ry, Where @ S15 <7, < “in, = Os The n,n, `++ n, rectan- 
gles 
(12.13) B. 5, td, X °° AXIR | Sj, Snes l Spee 


then partition A. Call such a partition regular. It will first be shown that u adds for 
regular partitions: 


(12.14) M(A)= LY w(B,,...i,)- 


Jy+++Jk 


The right side of (12.14) is Ep}, sgng x: F(x), where the outer sum extends over 
the rectangles B of the form (12.13) and the inner sum extends over the vertices x of 
B. Now 


(12.15) Y $ sen, x F(x) = Z HG ye SEE 
B x x B 


where on the right the outer sum extends over each x that is a vertex of one or more 
of the B’s, and for fixed x the inner sum extends over the B’s of which it is a vertex. 
Suppose that x is a vertex of one or more of the B’s but is not a vertex of A. Then 
there must be an i such that x, is neither a; nor b;. There may be several such i, but 
fix on one of them and suppose for notational convenience that it is i= 1. Then 
x,;=t,, with 0<j<n,. The rectangles (12.13) of which x is a vertex therefore come 
m pairs B'-B,, -j and B" =B; j jp and sgng x = —sgng. x. Thus the inner 
sum on the right in (12.15) is 0 if x is not a vertex of A. 

On the other hand, if x is a vertex of A as well as of at least one B, then for each 
i either x;=a;= t;o Or x; =b; = t;„„ In this case x is a vertex of only one B of the 
form (12.13)—the one for which j;= 1 or j; = n;, according as x, =a, or x; =b;—and 
sgn, x = sgn, x. Thus the right side of (12.15) reduces to A4F, which proves (12.14). 

Now suppose that A=U"_,A,, where A is the bounded rectangle (12.8), 
A,=1,,% °°: X1,, for u=1,...,n, and the A, are disjoint. For each i (1 <i<k), 
the intervals J;,,...,/;, have J, as their union, although they need not be disjoint. But 
their endpoints split J; into disjoint subintervals J;,,...,J,,, such that each J, is the 
union of certain of the J, The rectangles B of the form (12.13) are a regular 
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partition of A, as before; furthermore, the B’s contained in a single A, form a 
regular partition of A,. Since the A, are disjoint, it follows by (12.14) that 


m(A) = 2n(B) = 2, dy b(B)e b uA): 


u=1 BCA, u=1 


Therefore, u is finitely additive on the class .4* of bounded k-dimensional rectan- 
gles. 

As shown in Example 11.4, %* is a semiring, and so Theorem 11.3 applies. If 
A, Aaya 4,, are sets in .%*, then by Lemma 2 of the preceding section, 


(12.16) MAYS L WAD P ACUE 


u=1 u=1 


To apply Theorem 11.3 requires showing that u is countably subadditive on A“. 


Suppose then that A C U?_, A,, where A and the A, are in .7*. The problem is to 
prove that 


(12.17) w(A)< LS w(A,). 


u=1 


Suppose that e > 0. If A is given by (12.7) and B =[x: a; + 6 <x; <b; i < k], then 
u(B) > u(A)-— e for small enough positive ô, because u is defined by (12.12) and F is 
continuous from above. Note that A contains the closure B =[x: a;+6 <x; <b; 
i<k] of B. Similarly, for each u there is in 4%* a set B,=[x: a;, <x; <b,, +6,, 
i < k] such that p(B,) < u(A,)+e€/2" and A, is in the interior By =[x: a, <x; < 
b +8,, ERO B,. 

Since B 7CA C UZ; A, C U;,_, B}, it follows by the Heine—Borel theorem that 


BCB IGA IES Un, B, for some n. Now (12.16) applies, and so p(A)—e < 
u(B)< E” w(B,,) < Le_,w(A,) +. Since e was arbitrary, the proof of (12.17) is 
complete. 

Thus p as defined by (12.12) is finitely additive and countably subadditive on the 


semiring J *. By Theorem 11.3, u extends to a measure on 2“ =o(.F*). o 


Strange Euclidean Sets* 


It is possible to construct in the plane a simple curve—the image of [0,1] under a 
continuous, one-to-one mapping—having positive area. This is surprising because the 
curve is simple: if the continuous map is not required to be one-to-one, the curve can 
even fill a square.* 

Such constructions are counterintuitive, but nothing like one due to Banach and 
Tarski: Two sets in Euclidean space are congruent if each can be carried onto the 
other by an isometry, a rigid transformation. Suppose of sets A and B in R* that A 
can be decomposed into sets A,,...,A, and B can be decomposed into sets 
B,,...,B, in such a way that A; and B; are congruent for each i=1,...,n. In this 
case A and B are said to be congruent by dissection. If all the pieces A; and B; are 
Borel sets, then of course A,(A) = E} 1A4(4;) = Lj. A,(B)) = A, CB). But if nonmea- 


"This topic may be omitted. 
A Peano curve: see HAusporFF, p. 231. For the construction of simple curves of positive area, 
see GELBAUM & OLMSTED, pp. 135 ff. 
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Surable sets are allowed in the dissections, then something astonishing happens: If 
k > 3, and if A and B are bounded sets in RY and have nonempty interiors, then A and B 
are congruent by dissection. (The result does not hold if k is 1 or 2.) 

This is the Banach-Tarski paradox. It is usually illustrated in 3-space this way: It is 
possible to break a solid ball the size of a pea into finitely many pieces and then put 
them back together again in such a way as to get a solid ball the size of the sun.! 


PROBLEMS 


12.1. 


12.2. 


12.3. 


12.4. 


Suppose that u is a measure on 2' that is finite for bounded sets and is 
translation-invariant: (A +x)=p(A). Show that w(A)=aACA) for some 
a > 0. Extend to R*. 


Suppose that A E€ #!, A(A) > 0, and 0 <6 < 1. Show that there is a bounded 
open interval J such that A(A NI) >0A(I). Hint: Show that ACA) may be 
assumed finite, and choose an open G such that A c G and A(A) > AG). 
Now G=U,J,, for disjoint open intervals J, [A12], and L,ACAN/,)= 
OLAI); use an I,. 


t If AER! and A(A)>0, then the origin is interior to the difference set 
D(A) =[x-—y: x,y EA]. Hint: Choose a bounded open interval J as in 
Problem 12.2 for 0 = 3. Suppose that |z| < A(I)/2; since A N I and (ANI) +z 
are contained in an interval of length less than 3A(J)/2 and hence cannot be 
disjoint, z E D(A). 


t The following construction leads to a subset H of the unit interval that is 
nonmeasurable in the extreme sense that its inner and outer Lebesgue mea- 
sures are 0 and 1: A,(H)=0 and A*(H) = 1 (see (3.9) and (3.10)). Complete 
the details. The ideas are those in the construction of a nonmeasurable set at 
the end of Section 3. It will be convenient to work in G = [0, 1); let ® and © 
denote addition and subtraction modulo 1 in G, which is a group with iden- 
tity 0. 

(a) Fix an irrational 0 in G and for n=0,+1,+2,... let 6,, be n@ reduced 
modulo 1. Show that 0, © 0m = Onm On O Om = On- m and the 0, are distinct. 
Show that {@,,: n =0, + 1,...} and {6,,,,: n =0, + 1,...} are dense in G. 

(b) Take x and y to be equivalent if x © y lies in {6,: n =0, + 1,...}, which is 
a subgroup. Let S contain one representative from each equivalence class 
(cach coset). Show that G=U,($@9,), where the union is disjoint. Put 
H = U,(S ®6,,,) and show that G-H=Ho® 0. 


(c) Suppose that A is a Borel set contained in H. If A(A)>0, then D(A) 
contains an interval (0, e); but then some 0,,,, lies in (0, €) c D(A) c D(A), 
and sO Don 41 =h, pe h, =h, Əh, = (s, ® 93,,) 8 (s3 ® 9>,,) for some hy, h2 in 
H and some s,,s, in $. Deduce that s; =s, and obtain a contradiction. 
Conclude that A ,(H) = 0. 


(d) Show that A,(H ® 6)=0 and A*(H)=1. 


tSee Wacon for an account of these prodigies. 
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12.5. 


12.6. 


12.7. 


12.8. 


12.9. 


12.10. 


12.11. 


12.12. 


t The construction here gives sets H, such that H, 1G and A,(H,,) =0. If 
J= GH, then J, (Gand A°W,) = 1h 


(a) Let H, = U} „(S © 6,), so that H, t G. Show that the sets H, ® O99, 41), 
are disjoint for different v. 


(b) Suppose that A is a Borel set contained in H,. Show that A and indeed 
all the A ® 65,41), have Lebesgue measure 0. 


Suppose that u is nonnegative and finitely additive on A* and that u( R4) < œ. 
Suppose further that u(A)= supa(K), where K ranges over the compact 
subsets of A. Show that u is countably additive. (Compare Theorem 12.3(ii).) 


Suppose u is a measure on 2% such that bounded sets have finite measure. 
Given A, show that there exist an F,-set U (a countable union of closed sets) 


and a G;-set V (a countable intersection of open sets) such that U CA CV and 
w(V —U)=0. 


2.197 Suppose that u is a nonatomic probability measure on (R*, A*) and 


that u(A)>0. Show that there is an uncountable compact set K such that 
KCA and u(K)=0. 


The minimal closed support of a measure u on &* is a closed set C „ Such that 
C, CC for closed C if and only if C supports u. Prove its existence and 
uniqueness. Characterize the points of C, as those x such that n(U)> 0 for 
every neighborhood U of x. If k=1 and if u and the function F(x) are 
related by (12.5), the condition is F(x — e) < F(x + e) for all e; x is in this case 
called a point of increase of F. 


Of minor interest is the k-dimensional analogue of (12.4). Let I, be (0, t] for 
t>0 and (1,0) for t <0, and let A,=/, X- -= XI... Let g(x) be +1 or -1 
according as the number of i, 1 <i < k, for which x; <0 is even or odd. Show 
that, if F(x) = e(x)u(A,), then (12.12) holds for bounded rectangles A. 

Call F degenerate if it is a function of some k — 1 of the coordinates, the 
requirement in the case k = 1 being that F is constant. Show that A aF =0 for 
every bounded rectangle if and only if F is a finite sum of degenerate 
functions; (12.12) determines F to within addition of a function of this sort. 


Let G be a nondecreasing, right-continuous function on the line, and put 
F(x, y) = min{G(x), y}. Show that F satisfies the conditions of Theorem 12.5e 
that the curve C =[(x,G(x)): x © R'] supports the corresponding measu (u 

and that A,(C) = 0. 


Let F; and F, be nondecreasing, right-continuous functions o; 
put F(x, x2) = F\(x,)F,(+2). Show that F satisfies the co: 

12.5. Let , 411,42 be the measures corresponding to F, F 
UA, X A>) = p (A, (A) for intervals A, and A). This 
p, and u3; products are studied in a general setting in 
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If a real function X on © has finite range, it is by the definition in Section 5 
a simple random variable if [w: X(w) =x] lies in the basic o-field Z for 
each x. The requirement appropriate for the general real function X is 
stronger; namely, [w: X(w) € H] must lie in Z for each linear Borel set H. 
An abstract version of this definition greatly simplifies the theory of such 
functions. 


Measurable Mappings 


Let (Q, F ) and ((', ¥') be two measurable spaces. For a mapping T: 
Q > Q', consider the inverse images TT ™' =[w EQ: Tw € A’) for A’ cM. 
(See [A7] for the properties of inverse images.) The mapping T is measurable 
FIF if T 'A'Ee F for each A' E F'. 

For a real function f, the image space 1’ is the line R', and in this case 
PZ’ is always tacitly understood to play the role of F’. A real function f on 
Q is thus measurable Z (or simply measurable, if it is clear from the context 
what ¥ is involved) if it is measurable Y/#'!—that is, if f 'H= 
lw: f(w) © H]€ F for every H € B'. In probability contexts, a real measur- 
able function is called a random variable. The point of the definition is to 
ensure that [w: f(w)@H] has a measure or probability for all sufficiently 
regular sets H of real numbers—that is, for all Borel sets H. 


Example 13.1. A real function f with finite range is measurable if 
f {x} © F for each singleton {x}, but his is too weak a condition to impose 
on the general f. (It is satisfied if (Q, F ) =(R', 2') and f is any one-to-one 
map of the line into itself; but in this case f~'H, even for so simple a set H 
as an interval, can for an appropriately chosen f be any uncountable set, say 
the non-Borel set constructed in Section 3.) On the other hand, for a 
measurable f with finite range, f 'H € F for every H CR’; but this is too 
strong a condition to impose on the general f. (For (Q, F ) = (R!, 2’), even 
f(x) =x fails to satisfy it.) Notice that nothing is required of fA; it need not 
lie in YZ! for A in F. 5 


If in addition to (Q, F ), (Q', F’), and the map T: Q > 1’, there is a third 
measurable space (Q”", 7") and a map T': Q' > 2", the composition T’T = 
T'o T is the mapping Q > Q" that carries w to T'(T(w)). 


Theorem 13.1. (i) If T~'A' © F for each A' € ' and Æ' generates F', 
then T is measurable F/ F'. 

(ii) If T is measurable F/ F' and T' is measurable F'/ F" then T'T is 
measurable F/ F". 
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Proor. Since T~'\(Q'—A')=Q-—T-'4' and T UA) = U,T',,, 
and since ¥ is a o-field in Q, the class [A’: T~14' e F ] is a o-field in Q. If 
this o-field contains .97’, it must also contain o(.0/’), and (i) follows. 

As for (ii), it follows by the hypotheses that A” © Z" implies that 
(T) 'A" © F', which in turn implies that (T'T)- A" s= lo: T'To €A]= 
[w: Tw ETV A= T OGT 4 © F. zB 


By part (i), if f is a real function such that [w: F(w) <x] lies in FY for all 
x, then f is measurable Z. This condition is usually easy to check. 


Mappings into R* 


For a mapping f: Q > R* carrying Q into k- -space, &* is always understood 
to be the o-field in the image space. In probabilistic contexts, a measurable 
mapping into R* is called a random vector. Now f must have the form 


(13.1) f(@) = (fi(@),.-.,f.(o)) 


for real functions f,(w). Since the sets (12.9) (the “southwest regions”) 
generate BK, Thoen 13.14) implies that f is measurable Z if and only if 
the set 


(13.2) [w: IRO) SXi- fel) <x] = z N [e : fi(@) <x; ‘| 


j= 


lies in F for each (x,,..., Xx). This condition holds if each f, is measurable 
Be On the other hand, if x, =x is fixed and x1 = =x, ") =x) a 

=n goes to œ, the sets (13. 2) increase to [w: f;(w) Zale the condi thus 
aan that each f; is measurable. Therefore, f is measurable F if and only 
if each component function f; is measurable F. This provides a practical 
criterion for mappings into R*. 

A mapping f: R'— R* is defined to be measurable if it is measurable 
R'/R*, Such functions are often called Borel functions. To sum up, T: 
Q> ' is measurable F/ F' if T'A E F for all A'E F'; f: NORE is 
measurable ¥ if it is measurable F/A*; and f: R' > R* is measurable (a 
ae function) if it is measurable A'/A*. If H lies outside 2', then lu 
(i =k = 1) is not a Borel function. e 


Theorem 13.2. If f: R' > R* is continuous, then it is measurable. 


Proor. As noted above, it suffices to check that each s i 
P'. But each is closed because of continuity. i 
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Theorem 13.3. If f: Q—>R! is measurable F, j=1,...,k, then 
8(f(w),..., f,(w)) is measurable F if g: RY > R' is measurable—in particu- 
lar, if it is continuous. 


Proor. If the f, are measurable, then so is (13.1), so that the result 
follows by Theorem 13.1(ii). s 


Taking g(x,,...,x,) to be L*_,x,, Tipe and max{x,,...,x,) in turn 
shows that sums, products, and maxima of measurable functions are measur- 
able. If f(w) is real and measurable, then so are sin f(w), e’/”, and so on, 
and if f(@) never vanishes, then 1/f(w) is measurable as well. 


Limits and Measurability 


For a real function f it is often convenient to admit the artificial values œ% 
and —œ—to work with the extended real line [—~,]. Such an f is by 
definition measurable Z if [w: f(w) €H] lies in FY for each Borel set H of 
(finite) real numbers and if [w: f(@) = œ] and [w: f(w) = — æ] both lie in F. 
This extension of the notion of measurability is convenient in connection with 
limits and suprema, which need not be finite. 


Theorem 13.4. Suppose that f,, f,,... are real functions measurable F. 


(i) The functions sup, f,,, inf, fa, limsup,, fa, and liminf, f,, are measur- 
able F. 


Gi) If lim, f,, exists everywhere, then it is measurable F. 
(iii) The w-set where {f,(w)} converges lies in F. 
(iv) If f is measurable F, then the w-set where f,(w) > f(w) lies in F. 


Proor. Clearly, [sup, f, <x]= 1, 1f,, <x] lies in F even for x = œ and 
x = —, and so sup, f, is measurable. The measurability of inf, f,, follows 
the same way, and hence limsup, f, =inf,sup,.,, fọ and liminf, f, = 
sup, inf, .,,f, are measurable. If lim, f, exists, it coincides with these last 
two functions and hence is measurable. Finally, the set in (iii) is the set where 
lim sup,, f,(w) = lim inf,, f,(w), and that in (iv) is the set where this common 
value is f(w). a 


Special cases of this theorem have been encountered before—part (iv), for 
example, in connection with the strong law of large numbers. The last three 
parts of the theorem obviously carry over to mappings into R*. 

A simple real function is one with finite range; it can be put in the form — 


(13.3) f= L xila, 
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where the A; form a finite decomposition of ©. It is measurable F if each 
A, lies in Z. The simple random variables of Section 5 have this form. 

Many results concerning measurable functions are most easily proved first 
for simple functions and then, by an appeal to the next theorem and a 
passage to the limit, for the general measurable function. 


Theorem 13.5. Iff is real and measurable F, there exists a sequence {f,,} 
of simple functions, each measurable F, such that 


(13.4) 0<f,(@)t flo) if f(w) =0 
and 
(13.5) 0>f(w)L flw) if f(w) <0. 


Proor. Define 


mali if —~<f(w) < —n, 
Ak- Ya if —k27" <f(o) == Gage 


V<k=<nz 
rales = TA a7; 7 
(13.6) f,(@) (k-1)2™ if(k=1)2"<f(o) <k2—, 
L<k<n2". 
n ifn <f(w) <~. 


This sequence has the required properties. i] 


Note that (13.6) covers the possibilities f(w) =œ and f(w) = — o. 
If AE F, a function f defined only on A is by definition measurable if 
[w EA: f(w) € H] lies in F for HE! and for H = {œ} and H = {— o9}. 


Transformations of Measures 


Let (Q, F) and (Q', F') be measurable spaces, and suppose that the 
mapping T: Q > Q' is measurable F/ F'. Given a measure u on F, define 
a set function uT™! on F’ by 


(13.7) LTA Y= p(T A), A oa 


That is, wT~' assigns value u(T~'A') to the set A’. If Ae F’, tt 
T~'4'© F because T is measurable, and hence the set functio 
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sets in Q if the A’, are disjoint sets in Q’, the countable additivity of w7~! 
follows from that of u. Thus w7!| is a measure. This way of transferring a 
measure from Q to Q' will prove useful in a number of ways. 

If is finite, so is wT '; if u is a probability measure, so is prey 


PROBLEMS 


13.1. Functions are often defined in pieces (for example, let f(x) be x? or x`! as 
x>0or x <0), and the following result shows that the function is measurable 
if the pieces are. 

Consider measurable spaces (Q, F ) and (’, F') and a map T: 1 >)’, 
Let A,, A;,... be a countable covering of Q by #sets. Consider the o-field 


F =[A: ACA,, AE Flin A, and the restriction T, of T to A, Show that 
T is measurable F/ F' if and only if T, is measurable Z,/ F' for each n. 


13.2. (a) For a map T and o-fields F and F', define T 'F'=[T MW: A'E F] 
and TF=[A': T M'e F]. Show that T 'F' and TF are o-fields and that 
measurability ¥/ F' is equivalent to T !'F' c F and to F' CTF. 

(b) For given ¥', T '¥', which is the smallest o-field for which T is 
measurable ¥/.F', is by definition the o-field generated by T. For simple 
random variables describe o(X,,..., X,,) in these terms. 

(c) Let o'(.27’) be the o-field in Q' generated by ./’. Show that o(T !.27) = 
T ‘(o'(./')). Prove Theorem 10.1 by taking T to be the identity map from Qy 
to O. 


13.3. t Suppose that f: > R'. Show that f is measurable T~!¥" if and only if 
there exists a map ọ: Q' > R' such that ¢ is measurable F’ and f = oT. Hint: 
First consider simple functions and then use Theorem 13.5. 


13.4. t Relate the result in Problem 13.3 to Theorem 5.1(ii). 


13.5. Show of real functions f and g that f(w)+g(w) <x if and only if there exist 
rationals r and s such that r+s<x, f(w)<r, and g(w)< s. Prove directly 
that f+g is measurable ¥ if f and g are. 


13.6. Let F be a o-field in R'. Show that #' c F if and only if every continuous 
function is measurable F. Thus #! is the smallest o-field with respect to 
which all the continuous functions are measurable. 


13.7. Consider on R ! the smallest class Z (that is, the intersection of all classes) of 
real functions containing all the continuous functions and closed under point- 
wise passages to the limit. The elements of 2’ are called Baire functions. Show 
that Baire functions and Borel functions on R! are the same thing. 


13.8. A real function f on the line is upper semicontinuous at x if for each e there 


is a 6 such that |x =i <6 implies that f(y) < f(x) +e. Show that, if f 
everywhere upper semicontinuous, then it is measurable. i. Ma 


tBut see Problem 13.14. 
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13:9; 


13.10. 


13.11. 


13.12. 


13.13. 


13.14. 


13.15. 


13.16. 


13.17. 
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Distribution Functions 


A random variable as defined in Section 13 is a measurable rea 


Suppose that f, and f are finite-valued, Ameasurable functions such that 
fo) > flw) for w E€ A, where u( A) < © (u a measure on F ). Prove Egoroff’s 
theorem: For each e there exists a subset B of A such that w(B)<e and 
faw) > flw) uniformly on A — B. Hint: Let B“ be the set of w in A such 
that | f(w) — flw) > k~' for some i >n. Show that BU) | /@ as n 1, choose 
n, so that u( BX) <€/2*, and put B= UZ, By, 


t Show that Egoroffs theorem is false without the hypothesis u( A) < ©. 


2.97 Show that, if f is measurable o(.o7), then there exists a countable 
subclass .% of . such that f is measurable o(.%,). 


Circular Lebesgue measure. Let C be the unit circle in the complex plane, and 
define T: [0,1) > C by Tw = e?7'”. Let Z consist of the Borel subsets of (0, 1), 
and let A be Lebesgue measure on Z. Show that 6=[A: T'A € Z] consists 
of the sets in 2? (identify R? with the complex plane) that are contained in 
C. Show that @ is generated by the arcs of C. Circular Lebesgue measure is 


defined as u = AT '. Show that u is invariant under rotations: u[@z: z € A] = 
u(A) for AE @ and 0EC. 


T Suppose that the circular Lebesgue measure of A satisfies y(A)> 1- n`! 
and that B contains at most n points. Show that some rotation carries B into 
A: 6B CA for some 9 in C. 


Show by example that u o-finite does not imply uT ' o-finite. 


Consider Lebesgue measure A restricted to the class @ of Borel sets in (0, 1]. 
For a fixed permutation 7,,n,,... of the positive integers, if x has dyadic 
expansion .x,X,..., take Tx =.x, X,,.... Show that T is measurable 8/8 
and that AT; ' =A. 


Let H, be the union of the intervals ((i — 1)/2*,i/2*] for i even, 1 <i < 2*. 
Show that if 0<f(w)<1 for all w and A,=f '(H,) then f(w)= 
Le f@)/2", an infinite linear combination of indicators. 


Let S= {0,1}, and define a map T from sequence space $” to [0,1] by 
Tw = LZ _,a,(w)/2*. Define a map U of [0, 1] to S” by Ux = (d(x), d (x),...), 
where the d,(x) are the digits of the nonterminating dyadic expansion of x 
(and d,(0)=0). Show that T is measurable @/@ and that U is measurable 
B/E. Let P be the measure specified by (2.21) for pp =p, = 4. Describe 
PT™' and AU" 


on a probability measure space (Q, F, P). The distribution or law 
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random variable is the probability measure » on (R', A') defined by 
(14.1) u(A)=P[XeA], AES. 


As in the case of the simple random variables in Chapter 1, the argument w 
is usually omitted: P[X € A] is short for P[w:X(w) EA]. In the notation 
(13.7), the distribution is PX” '. 

For simple random variables the distribution was defined in Section 
5—see (5.12). There was defined for every subset of the line, however; 
from now on u will be defined only for Borel sets, because unless X is 
simple, one cannot in general be sure that [X €A] has a probability for A 
outside #’. 

The distribution function of X is defined by 


(14.2) F(x) =p(-%, x] =P[X <x] 

for real x. By continuity from above (Theorem 10.2(ii)) for u, F is right- 
continuous. Since F is nondecreasing, the left-hand limit F(x —)= 
lim, , , F(y) exists, and by continuity from below (Theorem 10.2(i)) for u, 
(14.3) F(x—) =p(-&, x) =P[X <x]. 

Thus the jump or saltus in F at x is 


F(x) —F(x-) =u{x} =P[X=x]. 


Therefore (Theorem 10.2(iv)) F can have at most countably many points of 
discontinuity. Clearly, 


(14.4) lim F(x) =0, lim F(x) =1. 


A function with these properties must, in fact, be the distribution function 
of some random variable: 


Theorem 14.1. /f F is a nondecreasing, right-continuous function satisfying 
(14.4), then there exists on some probability space a random variable X for 
which F(x) = P[X <x]. 


First Proor. By Theorem 12.4, if F is nondecreasing and right continu- 
ous, there is on (R', #') a measure u for which p(a, b] = F(b) — F(a) But 
lim, -e F(x) =0 implies that u(-— œ, x] = F(x), and lim, F(x) =1 im 
plies that 4(R') = 1. For the probability space take (Q, F, P) = (RE g'. 
and for X take the identity function: X(w) =. Then P[X <x]=yplo €. 
w <x) = F(x). — 
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SECOND Proor. There is a proof that uses only the existence of Lebesgue 
measure on the unit interval and does not require Theorem 12.4. For the 
probability space take the open unit interval: Q is (0,1), Z consists of the 
Borel subsets of (0,1), and P(A) is the Lebesgue measure of A. 

To understand the method, suppose at first that F is continuous and 
strictly increasing. Then F is a one-to-one mapping of R! onto (0, 1); let ¢: 
(0,1) > R' be the inverse mapping. For 0 < w < 1, let X(w) = p(w). Since @ 
is increasing, certainly X is measurable F. If 0 <u <1, then (u) <x if and 
only if u < F(x). Since P is Lebesgue measure, P[X <x] = P[w € (0, 1): 
olw) <x] = Plw E (0,1): w < F(x)] = F(x), as required. 


F (x) 


y(u) piv) 


If F has discontinuities or is not strictly increasing, define’ 
(14.5) y(u) = inf| x: u < F(x)] 


for 0 <u < 1. Since F is nondecreasing, [x: u < F(x)] is an interval stretching 
to æ; since F is right-continuous, this interval is closed on the left. For 
0 <u <1, therefore, [x: u < F(x)] =[g(w), ©), and so (u) <x if and only if 
u<F(x). If X(w)=g(w) for 0<w <1, then by the same reasoning as 
before, X is a random variable and P[ X <x] = F(x). a 


This second argument actually provides a simple proof of Theorem 12.4 
for a probability distribution? F: the distribution pu (as defined by (14.1)) of 
the random variable just constructed satisfies u(—, x] = F(x) and hence 
ua, b) = F(b) — F(a). 


Exponential Distributions 


* a. 
bs EEN 


There are a number of results which for their interpretation require random > 
variables, independence, and other probabilistic concepts, but which be 
discussed technically in terms of distribution functions alone ang 
require the apparatus of measure theory. 


"This is called the quantile function. 
*For the general case, see Problem 14.2. 
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Suppose as an example that F is the distribution function of the waiting 
time to the occurrence of some event—say the arrival of the next customer at 
a queue or the next call at a telephone exchange. As the waiting time must be 
positive, assume that F(0) = 0. Suppose that F(x) <1 for all x, and further- 


more suppose that 


(14.6) a L- FO): x,y 20 


The right side of this equation is the probability that the waiting time exceeds 
y; by the definition (4.1) of conditional probability, the left side is the 
probability that the waiting time exceeds x + y given that it exceeds x. Thus 
(14.6) attributes to the waiting-time mechanism a kind of lack of memory or 
aftereffect: If after a lapse of x units of time the event has not yet occurred, 
the waiting time still remaining is conditionally distributed just as the entire 
waiting time from the beginning. For reasons that will emerge later (see 
Section 23), waiting times often have this property. 

The condition (14.6) completely determines the form of F. If U(x)= 
1—F(x), (14.6) is U(x+y)=U(x)U(y). This is a form of Cauchy’s 
equation [A20], and since U is bounded, U(x)=e ** for some a. Since 
lim, ,,, U(x) =0, æ must be positive. Thus (14.6) implies that F has the 
exponential form 


Ope if x <0, 
ae Meee tee if x >0, 


and conversely. 


Weak Convergence 


Random variables X,,..., X, are defined to be independent if the events 
A <A,]),--.,[X, EA Ni are kadens ndeni for all Borel sets A,,...,A,, SO 
that PLX, EA, n= ATP eA To find) the distribution func- 
tion of the maximum M, = max{X,,.. AN fale A, * =A, =(- œ, x]. g 
This gives P[M, <x] = rT? iPLX, zal If the X; ais independ and have — 
common distribution function G and M,, has distribution function F,, | 


lope ee ee 


(14.8) Fax) = G(x), 


It is possible without any appeal to measure theory to study the 
function F, solely by means of the relation (14.8), which can indeed be t 
as defining F,. It is possible in particular to study the asymntahg ropertic 
of F,: | 
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Example 14.1. Consider a stream or sequence of events, say arrivals of 
calls at a telephone exchange. Suppose that the times between successive 
events, the interarrival times, are independent and that each has the expo- 
nential form (14.7) with a common value of a. By (14.8) the maximum M, 
among the first  interarrival times has distribution function F(x) = (1 — 
e °*)", x > 0. For each x, lim, F(x) = 0, which means that M, tends to be 
large for n large. But P[M, — a` ' logn <x]= F(x +a ' logn). This is the 
distribution function of M, —&œ`' log n, and it satisfies 


14.9 K(x +a ioga) = (1=6 Dy apg t 
( n 


as n > ; the equality here holds if log n > —ax, and so the limit holds for 
all x. This gives for large n the approximate distribution of the normalized 
random variable M, — a™' log n. 5 


If Fa and F are distribution functions, then by definition, F, converges 
weakly to F, written F, = F, if 


(14.10) lim F,(x) = F(x) 


for each x at which F is continuous.’ To study the approximate distribution 
oí a random variable Y, it is often necessary to study instead the normalized 
or rescaled random variable (Y, — b„)/a„ for appropriate constants a, and 
b, If Y, has distribution function F, and if a, > 0, then P[(Y, —b,)/a, <x] 
=PlY, <a,x+b,], and therefore (Y,—b5,)/a, has distribution function 
F.(a,x +,). For this reason weak convergence often appears in the form? 


(14.11) F,(a,x +b,) > F(x). 


-ar 


An example of this is (14.9): there a, = 1, b, =a ' logn, and F(x)=e~° 


Example 14.2. Consider again the distribution function (14.8) of the 
maximum, but suppose that G has the form 


To if x <1, 
eaa if x>1, 


where a > 0. Here F (n!/*x)= (1 —n7~'x~*)" for x >n7~'/*, and therefore 


0 if x <0, 


r 1/a = t 
lim Fk aie) hae ifx > 0. 


This is an example of (14.11) in which a, =n'/“ and b, = 0. iS) 


For the role of continuity, see Example 14.4. 

To write F (a„x + b,,) = F(x) ignores the distinction between a function and its value at an 
unspecified value of its argument, but the meaning of course is that F,(a,x +b,) > F(x) at 
continuity points x of F. 
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Example 14.3. Consider (14.8) once more, but for 


0 if x <0, 
G(x)=(1-(1-x)° if0<x<1, 
1 itxe l, 


where a > 0. This time F,(n7!/%x +1)=(1 -n7(-x)°)" if —n'7* <x <0. 
Therefore, 


-=(+x) | 
lim R(n =e + 1) = fS ata 


a case of (14.11) in which a, =n~'/% and b, = 1. a 


Let A be the distribution function with a unit jump at the origin: 


rela ihe <a 
Cea aC) = {4 if x>0. 


If X(w) =0, then X has distribution function A. 


Example 14.4. Let X,X,,... be independent random variables for which 
P[X, = 1)=P[X, = —1] =, and put S, =X,+ --- +X,. By the weak law 
of large numbers, 


(14.13) P[{|n-'S,,|>e] +0 


for « >0. Let F, be the distribution function of n~'S,. If x>0, then 

FE(x)=1 — Pin-'s, > xi leit x <0) then F(x) <Plin-'s nl 2 Ix] > 0. As 

this accounts for all the continuity points of A, F, = A. It is easy to turn the 

argument around and deduce (14.13) from F, =A. Thus the weak law of 

large numbers is equivalent to the assertion that the distribution function of 
n—'S, converges weakly to A. 

If n is odd, so that S, =0 is impossible, then by symmetry the events 
[S,, < 0] and [S, > 0] each have probability 4 and hence F „(0) = 4. Thus F,(0) 
does not converge to A(0)= 1, but because A is discon at 0, the 
definition of weak convergence does not require this. 


possible to bring the weak law of large numbers under the the S } Gigi ak 
convergence. But if (14.10) need hold only for certain 
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arises the question of whether weak limits are unique. Suppose that F, Ril 
and F, = G. Then F(x)=lim, F(x) = G(x) if F and G are both continu- 
ous at x. Since F and G each have only countably many points of discontinu- 
ity,’ the set of common continuity points is dense, and it follows by right 
continuity that F and G are identical. A sequence can thus have at most one 
weak limit. 

Convergence of distribution functions is studied in detail in Chapter 5. 
The remainder of this section is devoted to some weak-convergence theorems 
which are interesting both for themselves and for the reason that they require 
so little technical machinery. 


Convergence of Types“ 


Distribution functions F and G are of the same type if there exist constants 
a and b, a> 0, such that F(ax + b) = G(x) for all x. A distribution function 
is degenerate if it has the form A(x — xọ) (see (14.12)) for some xg; otherwise, 
it is nondegenerate. 


Theorem 14.2. Suppose that F,(u,x+v,)= F(x) and F(a,x+b,)=> 
G(x), where u,, > 0, a, > 0, and F and G are nondegenerate. Then there exist a 
and b, a >Q, such that a,,/u,, > a, (b, —v,)/u, > b, and Elax +b) = G(x). 


Thus there can be only one possible limit type and essentially only one 
possible sequence of norming constants. 

The proof of the theorem is for clarity set out in a sequence of lemmas. In 
all of them, a and the a, are assumed to be positive. 


Lemma 1. /f F, > F,a,—a, and b, >b, then F(a,x+b,) => F(ax +b). 


Proof. If x is a continuity point of F(ax +b) and e > 0, choose conti- 
nuity points u and v of F so that u<ax+b<v and F(v)—F(u)<e: 
this is possible because F has only countably many discontinuities. For large 
enough n, u<a,x+b, <v, |F(u)—F(u)|<e, and |F.(v) — F(v)| <<; but 
then F(ax + b)—2e < F(u)-— e< Fu) <F (a,x +b,) <F(v) < Flu) +e < 
F(ax + b) + 2e. ey 


Lemma 2. If F, = F and a, >, then F(a, x) = A(x). 


Proor. Given e, choose a continuity point u of F so large that F(u)> 
l — e. If x > 0, then for all large enough n, a,x >u and |F (u) — F(u)| < e, so 


t F 
The proof following (14.3) uses measure theory, but this is not necessary: If the saltus 


o(x) = F(x) — F(x — ) exceeds e at x, < +++ <x,, then F(x;) — F(x;_,) > e€ (take Sn <x,), paaa 


so he < F(x,,) — F(xo) < 1; hence [x: o(x) >] is finite and [x: ø(x) > 0) is countable. i 
This topic may be omitted. o_o Ng 


194 MEASURE 


that F(a,x)>F(u) > F(u)—e>1-—2e. Thus lim, F(a,x)=1 for x>0; 
similarly, lim, F,(a,,x) = 0 for x < 0. a 


Lemma 3. If F,=F and b, is unbounded, then F(x + b,) cannot con- 
verge weakly. 


Proor. Suppose that b, is unbounded and that b, >œ along some 
subsequence (the case b, > —© is similar). Suppose that F(x + b,) = G(x). 
Given e, choose a continuity point u of F so that F(u) > 1—e. Whatever x 
may be, for n far enough out in the subsequence, x + b, >u and F(u) > 1—- 
2e, so that F(x + b,) > 1-— 2e. Thus G(x) = lim, F(x + b„) = 1 for all con- 
tinuity points x of G, which is impossible. E 


Lemma 4. If F (x)= F(x) and F (a,x + b,) = G(x), where F and G are 
nondegenerate, then 
(14.14) 0 < inta, < supa, <2, sup |b,,| < æ. 


n n à 
Proor. Suppose that a, is not bounded above. Arrange by passing to a 
subsequence that a, #» œ. Then by Lemma 2, 


(14.15) F (a,x) = A(x). 
Since 
(14.16) FE (a (x +b,/a,)) =F,(a,x + b,) = G(x), 


it follows by Lemma 3 that b,/a, is bounded along this subsequence. By y 
passing to a further subsequence, arrange that b, /a, converges to some c. Bi 
(14.15) and Lemma 1, F,(a,(x + b„/a„)) = A(x + c) along this subsequence. 
But (14.16) now implies that G is degenerate, contrary to hypothesis. 

Thus a, is bounded above. If G,(x)=F,(a,x + b,), then G(x) = Gí x) 
and G,(a,'x —a,'b,) = F(x) = F(x). The result just proved shows that a7! 
is bounded. as 

Thus a, is bounded away from 0 and œ. If b, is not bounded, neither is 
b,/4,; pass to a subsequence along which b, /a, > + and a, converges | to 
a positive a. Since, by Lemma 1, F,(a,x) = F(ax) along the subsequence 
(14.16) and b„/a„ > +œ stand in contradiction (Lemma 3 again). Th : 
b,, is bounded. 


ereiore 
y md z 


Lemma 5. If F(x)=F(ax+b) for all x and F is nondegenerate, t 
a=1andb=0. — 


Proor. Since F(x) = F(a"x + (a""'+ +--+ +a + 1)b), 
Lemma 4 that a” is bounded away from 0 and œ, so that a -= 
follows that nb is bounded, so that b = 0. cory 
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PROOF OF THEOREM 14.2. Suppose first that u, =1 and v, =0. Then 
(14.14) holds. Fix any subsequence along which a,, converges to some positive 
a and b, converges to some b. By Lemma 1, F (a,x + b,) = F(ax + b) along 
this subsequence, and the hypothesis gives F(ax + b) = G(x). 

Suppose that along some other sequence, a, >u >Ü and D, 0: Then 
F(ux +v) = G(x) and F(ax + b) = G(x) both hold, so that u =a and v =b 
by Lemma 5. Every convergent subsequence of {(a,,, b„)} thus converges to 
(a,b), and so the entire sequence does. 

For the general case, let H,(x) = F(u,,x + v,). Then H,(x) > F(x) and 
H,(a,U, x + (b, —v,u,,') = G(x), and so by the case already treated, Gti, | 
converges to some positive a and (b, —v,)u,' to some b, and as before, 
F(ax + 6) = G(x). = 


Extremal Distributions* 


A distribution function F is extremal if it is nondegenerate and if, for some 
distribution function G and constants a, (a, >0) and b,, 


(14.17) G"(a,x+b,) => F(x). 


These are the possible limiting distributions of normalized maxima (see (14.8)), and 
Examples 14.1, 14.2, and 14.3 give three specimens. The following analysis shows that 
these three examples exhaust the possible types. 

Assume that F is extremal. From (14.17) follow G"*(a,x+b,)= F*(x) and 
G"“(a,,,.x + b,,) = F(x), and so by Theorem 14.2 there exist constants c, and d, 
such that c, is positive and 


(14.18) F*(x)=F(c,x+d,). 


From Flex +d) = F!*(x) = Fic, x + d,) = F(c(c,x +d,)+d,) follow (Lemma 
5) the relations 


(14.19) Ci ELJER dj, =cjd, t d Rd de 


Of course, c, = 1 and d, = 0. There are three cases to be considered separately. 


Case 1. Suppose that c, = 1 for all k. Then 
(14.20) F*(x)=F(x+d,),  F'/*¥(x) = F(x-d,), 


This implies that F//(x) = F(x +d,- d,). For positive rational r=j/k, put ô, = 
~ dp; (14.19) implies that the definition is consistent, and F"(x) = F(x + 6,). $ 
is nondegenerate, there is an x such that 0 < F(x) < 1, and it follows by (14 
d, is decreasing in k, so that 6, is strictly decreasing in r. 


"This topic may be omitted. 
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For positive real t let g(t) = info <, <: 6, (7 rational in the infimum). Then g(t) is 
decreasing in ft, and 


(14.21) F'(x) =F(x+9(t)) 


for all x and all positive t. Further, (14.19) implies that p(st) = (s) + g(t), so that by 
the theorem on Cauchy’s equation [A20] applied to g(e*), p(t) = —B log t, where 
B >0 because (t) is strictly decreasing. Now (14.21) with t=e*/* gives F(x) = 
exp{e~*/* log F(0)}, and so F must be of the same type as 


(14.22) Fix) =e 2. 


Example 14.1 shows that this distribution function can arise as a limit of distributions 
of maxima—that is, F} is indeed extremal. 


Case 2. Suppose that c, #1 for some ko, which necessarily exceeds 1. Then there 
exists an x’ such that c, x'+d, =x’; but (14.18) then gives F*o(x’) = F(x’), so that 
F(x’) is 0 or 1. (In Case 1, F has the type (14.22) and so never assumes the values 0 
and 1.) 

Now suppose further that, in fact, F(x’) = 0. Let x) be the supremum of those x 
for which F(x)=0. By passing to a new F of the same type one can arrange that 
Xo = 0; then F(x) =0 for x < 0 and F(x)> 0 for x > 0. The new F will satisfy (14.18), 
but with new constants d,. 

If a (new) d, is distinct from 0, then there is an x near 0 for which the arguments 
on the two sides of (14.18) have opposite signs. Therefore, d, = 0 for all k, and 


(14.23) F*(x) =F(c,x), F(x) = F(S) 


for all k and x. This implies that F//*(x) = F(xc;/c,). For positive rational r =j/k, 
put y,=c;/c,. The definition is again consistent by (14.19), and F’(x) = F(y,x). 
Since 0 < F(x) <1 for some x, necessarily positive, it follows by (14.23) that c, is 
decreasing in k, so that y, is strictly decreasing in r. Put w(t) = inf, 1-6 ¥o ae 
positive real t. From (14.19) follows (st) = w(s)W(t), and by the corollary to the 
theorem on Cauchy’s equation [A20] applied to w(e*), it follows that y(t) =17* for 
some €>0. Since F'(x)=F(w(t)x) for all x and for t positive, F(x)= 
exp{x~'/£ log F(1)} for x > 0. Thus (take a = 1/€) F is of the same type as 


(14.24) F, a(x) = ma A ifxis 0; 
gma if x> 0, 


Example 14.2 shows that this case can arise. 


Case 3. Suppose as in Case 2 that c, # 1 for some kọ, so that F(x’) is 0 or 1 
some x’, but this time suppose that F(x’) =1. Let x, be the infimum of those 

which F(x) = 1. By passing to a new F of the same type, arrange that x,= 
F(x) <1 for x <0 and F(x) =1 for x > 0. If d, #0, then for some x near 0, c 
of (14.18) is 1 and the other is not. Thus d, = 0 for all k, and (14.23) again holds 


SECTION 14. DISTRIBUTION FUNCTIONS 197 


again y;/, =C;/c, consistently defines a function satisfying F’(x) = F(y, x). Since P 
is nondegenerate, 0 < F(x) < 1 for some x, but this time x is necessarily negative, so 
that cą is increasing. 


The same analysis as before shows that there is a positive é such that F‘(x) = 


F(t*x) for all x and for ¢ positive. Thus F(x) = exp{(—x)!/é log F(—1)} for x <0, 
and F is of the type 


= Te af 
14.25 F ={¢ if x <0, 
ae s.a() fi if x>0. 


Example 14.3 shows that this distribution function is indeed extremal. 


This completely characterizes the class of extremal distributions: 


Theorem 14.3. The class of extremal distribution functions consists exactly of the 
distribution functions of the types (14.22), (14.24), and (14.25). 


It is possible to go on and characterize the domains of attraction. That is, it is 
possible for each extremal distribution function F to describe the class of G satisfying 
(14.17) for some constants a,, and b,—the class of G attracted to F.t 


PROBLEMS 


14.1. The general nondecreasing function F has at most countably many discontinu- 
ities. Prove this by considering the open intervals 


(sup F(u), inf F(v)) 


uUu<x 


—each nonempty one contains a rational. 


14.2. For distribution functions F, the second proof of Theorem 14.1 shows how to 
construct a measure u on (R!, #') such that pa, b] = F(b) — F(a). 
(a) Extend to the case of bounded F. 
(b) Extend to the general case. Hint: Let F(x) be —n or F(x) or n as 


F(x) < =n or —n < F(x) <n or n < F(x). Construct the corresponding p, and 
define u( A) = lim, p,,(A). 


14.3. (a) Suppose that X has a continuous, strictly increasing distribution function 

F. Show that the random variable F(X) is uniformly distributed over the unit 
interval in the sense that P[F(X)<u]=u for 0 <u <1. Passing from X to 
F(X) is called the probability transformation. 
(b) Show that the function (u) defined by (14.5) satisfies F(g(u)—)< 
u < F(g(u)) and that, if F is continuous (but not necessarily strictly increasing), 
then F(g(u)) =u for0<u<1. fe 
(c) Show that P[F(X) <u] = F(g(4) — ) and hence that the result in part- ill 
holds as long as F is continuous. kos 


"This theory is associated with the names of Fisher, Fréchet, Gnedenko, and 
information, see GALAMBOS. $ 
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14.4, 


14.6. 


14.8. 


14.9. 


MEASURE 


T Let C be the set of continuity points of F. 

(a) Show that for every Borel set A, P[ F(X) €A, X€C] is at most the 
Lebesgue measure of A. 

(b) Show that if F is continuous at each point of F~'A, then 
P[ F(X) € A] is at most the Lebesgue measure of A. 


. The Lévy distance d(F,G) between two distribution functions is the infimum of 


those e such that G(x —e)— e < F(x) < G(x + €) + e for all x. Verify that this 
is a metric on the set of distribution functions. Show that a necessary and 
sufficient condition for F, =F is that d(F,, F) > 0. 


12.37 A Borel function satisfying Cauchy’s equation [A20] is automatically 
bounded in some interval and hence satisfies f(x) =xf(1). Hint: Take K large 
enough that A[x: x> s, |f(x)|<K]> 0. Apply Problem 12.3 and conclude that 
f is bounded in some interval to the right of 0. 


. ? Consider sets S of reals that are linearly independent over the field of 


rationals in the sense that n,x, + +++ +n, x, = 0 for distinct points x; in S and 
integers n; (positive or negative) is impossible unless n; = 0. 

(a) By Zorn’s lemma find a maximal such S. Show that it is a Hamel basis. That 
is, show that each real x can be written uniquely as x =n,x, + °°: +n,X, for 
distinct points x; in S and integers n;. 

(b) Define f arbitrarily on S, and define it elsewhere by f(m,x, + ``" +n,x,) 
=n, f(x,)+ --- +n, f(x,). Show that f satisfies Cauchy’s equation but need 
not satisfy f(x) = xf(1). 

(c) By means of Problem 14.6 give a new construction of a nonmeasurable 
function and a nonmeasurable set. 


14.57 (a) Show that if a distribution function F is everywhere continuous, 
then it is uniformly continuous. 

(b) Let 6,(e) = sup[ F(x) — F(y): |x —y|<e] be the modulus of continuity of 
F. Show that d(F,G) < e implies that sup, |F(x) — G(x)| < e + 6,(e). 

(c) Show that, if F, =F and F is everywhere continuous, then F,(x) > F(x) 
uniformly in x. What if F is continuous over a closed interval? 


Show that (14.24) and (14.25) are everywhere infinitely differentiable, although 
not analytic. 


